Anthropic — Demystifying evals for AI agents

engineering blog · source date 2026-01-09 · added 2026-05-17 19:33:04 · updated 2026-05-30 17:20:13 · Open original blog

Problems / challenges / motivations

Agent evals are different from single-turn chat evals because agents use tools, change external state, and may fail across multiple turns even when the final answer sounds correct.
Final-message grading misses the most important question: did the task actually succeed in the environment, database, browser, files, or product workflow?
Aggregate scores can hide broken graders, ambiguous tasks, flaky environments, and valid solutions that do not match the expected path.

Key ideas

Anthropic breaks an agent eval into task, trial, agent harness, eval harness, transcript or trace, outcome, grader, and suite.
Outcome and state checks should be preferred where possible: tests pass, records changed correctly, reservations exist, browser state is right, or tool side effects match expectations.
Grading should combine deterministic checks, model-based judges, and human calibration instead of relying on one judgment channel.
Capability evals and regression evals serve different purposes: one measures frontier ability, the other protects behavior that already works.
Multiple trials matter because agent behavior is stochastic; pass@k and pass^k answer different reliability questions.

Why it matters for evals

This is the baseline reading for building agent eval systems as engineering infrastructure rather than as one-off benchmark scores.
The reusable lesson is to grade real outcomes first, preserve traces, inspect failures, and maintain eval suites like product tests.
It gives teams a practical vocabulary for separating model weakness, harness weakness, task ambiguity, grader error, and environment noise.

read deep dive

Comments

commented 2026-05-17 19:58:48

1. Task A single evaluation scenario. Example: “Ask the agent to book a meeting next Tuesday with Alice and Bob.” It includes: - input prompt - starting environment/state - expected success criteria 2. Trial One run of the agent on one task. Because agents are non-deterministic, the same task may be run multiple times. Example: Run the same booking task 5 times and compare success rate. 3. Grader The mechanism that decides whether the trial succeeded. Types: - code grader: checks database/file/API state - LLM grader: judges natural-language quality - human grader: expert/user manually scores it Example: Check whether the calendar event was actually created with the right attendees and time. 4. Transcript / Trace The full record of what the agent did during the trial. Includes: - user messages - model responses - tool calls - tool outputs - intermediate reasoning if available - errors/retries Useful for debugging why a task passed or failed. 5. Outcome The final result or final environment state after the agent finishes. This is usually more important than the final text. Example: The agent says “meeting booked” is not enough. The actual calendar must contain the correct meeting. 6. Eval harness The system that runs and manages evaluations. It handles: - loading tasks - resetting environments - running trials - collecting transcripts - invoking graders - aggregating metrics Think of it as the test runner for agent evals. 7. Agent harness The wrapper/scaffold that turns a model into an agent. It defines: - available tools - message loop - memory/context handling - stopping rules - tool-calling protocol - system prompt - environment interface

Anthropic — Demystifying evals for AI agents

Problems / challenges / motivations

Key ideas

Why it matters for evals

Related posts

Comments