AI & Agent Evaluation
475total visitsadmin

Vercel — AGENTS.md outperforms skills in our agent evals

engineering blog · source date 2026-01-27 · added 2026-05-17 19:43:41 · updated 2026-05-30 17:20:13 · Open original blog

Problems / challenges / motivations

  • Vercel wanted coding agents to use version-matched Next.js 16 documentation, but optional knowledge packages only help if the agent actually invokes them.
  • A support system can look good in theory while failing at the trigger layer: the agent may not know when to load a skill, may load it too late, or may be sensitive to wording.
  • Coding-agent evals therefore need to measure harness and UX behavior, not just the base model's coding ability.

Key ideas

  • In Vercel's evals, a compressed documentation index in always-present AGENTS.md reached a 100% pass rate.
  • Skills underperformed because the agent often failed to invoke them; in 56% of cases, the skill was never used.
  • Explicit instructions improved skill usage but made results fragile and prompt-sensitive.
  • Persistent context can beat optional tools when the main bottleneck is retrieval or triggering rather than knowledge quality.

Why it matters for evals

  • The experiment shows evals can expose harness failures that would be misattributed to model weakness.
  • For agent design, the question is not only “does the right information exist?” but “does the agent reliably receive and use it at the right time?”
  • The reusable eval pattern is to test support-system usage directly: trigger rate, timing, context quality, pass rate, and sensitivity to instruction wording.

Comments

No comments yet.