Vercel — AGENTS.md outperforms skills in our agent evals
engineering blog · source date 2026-01-27 · added 2026-05-17 19:43:41 · updated 2026-05-30 17:20:13 · Open original blog
1
Problems / challenges / motivations
- Vercel wanted coding agents to use version-matched Next.js 16 documentation, but optional knowledge packages only help if the agent actually invokes them.
- A support system can look good in theory while failing at the trigger layer: the agent may not know when to load a skill, may load it too late, or may be sensitive to wording.
- Coding-agent evals therefore need to measure harness and UX behavior, not just the base model's coding ability.
2
Key ideas
- In Vercel's evals, a compressed documentation index in always-present AGENTS.md reached a 100% pass rate.
- Skills underperformed because the agent often failed to invoke them; in 56% of cases, the skill was never used.
- Explicit instructions improved skill usage but made results fragile and prompt-sensitive.
- Persistent context can beat optional tools when the main bottleneck is retrieval or triggering rather than knowledge quality.
3
Why it matters for evals
- The experiment shows evals can expose harness failures that would be misattributed to model weakness.
- For agent design, the question is not only “does the right information exist?” but “does the agent reliably receive and use it at the right time?”
- The reusable eval pattern is to test support-system usage directly: trigger rate, timing, context quality, pass rate, and sensitivity to instruction wording.
Comments
No comments yet.