AI & Agent Evaluation

1,885total visitsadmin

Vercel — AGENTS.md outperforms skills in our agent evals

engineering blog · source date 2026-01-27 · added 2026-05-17 19:43:41 · updated 2026-05-30 17:20:13 · Open original blog

coding agents harnesses benchmarks

Problems / challenges / motivations

Vercel wanted coding agents to use version-matched Next.js 16 documentation, but optional knowledge packages only help if the agent actually invokes them.
A support system can look good in theory while failing at the trigger layer: the agent may not know when to load a skill, may load it too late, or may be sensitive to wording.
Coding-agent evals therefore need to measure harness and UX behavior, not just the base model's coding ability.

Key ideas

In Vercel's evals, a compressed documentation index in always-present AGENTS.md reached a 100% pass rate.
Skills underperformed because the agent often failed to invoke them; in 56% of cases, the skill was never used.
Explicit instructions improved skill usage but made results fragile and prompt-sensitive.
Persistent context can beat optional tools when the main bottleneck is retrieval or triggering rather than knowledge quality.

Why it matters for evals

The experiment shows evals can expose harness failures that would be misattributed to model weakness.
For agent design, the question is not only “does the right information exist?” but “does the agent reliably receive and use it at the right time?”
The reusable eval pattern is to test support-system usage directly: trigger rate, timing, context quality, pass rate, and sensitivity to instruction wording.

Comments

No comments yet.