Anthropic — Demystifying evals for AI agents
engineering blog · source date 2026-01-09 · added 2026-05-17 19:33:04 · updated 2026-05-30 17:20:13 · Open original blog
1
Problems / challenges / motivations
- Agent evals are different from single-turn chat evals because agents use tools, change external state, and may fail across multiple turns even when the final answer sounds correct.
- Final-message grading misses the most important question: did the task actually succeed in the environment, database, browser, files, or product workflow?
- Aggregate scores can hide broken graders, ambiguous tasks, flaky environments, and valid solutions that do not match the expected path.
2
Key ideas
- Anthropic breaks an agent eval into task, trial, agent harness, eval harness, transcript or trace, outcome, grader, and suite.
- Outcome and state checks should be preferred where possible: tests pass, records changed correctly, reservations exist, browser state is right, or tool side effects match expectations.
- Grading should combine deterministic checks, model-based judges, and human calibration instead of relying on one judgment channel.
- Capability evals and regression evals serve different purposes: one measures frontier ability, the other protects behavior that already works.
- Multiple trials matter because agent behavior is stochastic; pass@k and pass^k answer different reliability questions.
3
Why it matters for evals
- This is the baseline reading for building agent eval systems as engineering infrastructure rather than as one-off benchmark scores.
- The reusable lesson is to grade real outcomes first, preserve traces, inspect failures, and maintain eval suites like product tests.
- It gives teams a practical vocabulary for separating model weakness, harness weakness, task ambiguity, grader error, and environment noise.
read deep dive
Comments