AI & Agent Evaluation
475total visitsadmin

Anthropic — Demystifying evals for AI agents

engineering blog · source date 2026-01-09 · added 2026-05-17 19:33:04 · updated 2026-05-30 17:20:13 · Open original blog

Problems / challenges / motivations

  • Agent evals are different from single-turn chat evals because agents use tools, change external state, and may fail across multiple turns even when the final answer sounds correct.
  • Final-message grading misses the most important question: did the task actually succeed in the environment, database, browser, files, or product workflow?
  • Aggregate scores can hide broken graders, ambiguous tasks, flaky environments, and valid solutions that do not match the expected path.

Key ideas

  • Anthropic breaks an agent eval into task, trial, agent harness, eval harness, transcript or trace, outcome, grader, and suite.
  • Outcome and state checks should be preferred where possible: tests pass, records changed correctly, reservations exist, browser state is right, or tool side effects match expectations.
  • Grading should combine deterministic checks, model-based judges, and human calibration instead of relying on one judgment channel.
  • Capability evals and regression evals serve different purposes: one measures frontier ability, the other protects behavior that already works.
  • Multiple trials matter because agent behavior is stochastic; pass@k and pass^k answer different reliability questions.

Why it matters for evals

  • This is the baseline reading for building agent eval systems as engineering infrastructure rather than as one-off benchmark scores.
  • The reusable lesson is to grade real outcomes first, preserve traces, inspect failures, and maintain eval suites like product tests.
  • It gives teams a practical vocabulary for separating model weakness, harness weakness, task ambiguity, grader error, and environment noise.

read deep dive

Comments

commented 2026-05-17 19:58:48
1. Task A single evaluation scenario. Example: “Ask the agent to book a meeting next Tuesday with Alice and Bob.” It includes: - input prompt - starting environment/state - expected success criteria 2. Trial One run of the agent on one task. Because agents are non-deterministic, the same task may be run multiple times. Example: Run the same booking task 5 times and compare success rate. 3. Grader The mechanism that decides whether the trial succeeded. Types: - code grader: checks database/file/API state - LLM grader: judges natural-language quality - human grader: expert/user manually scores it Example: Check whether the calendar event was actually created with the right attendees and time. 4. Transcript / Trace The full record of what the agent did during the trial. Includes: - user messages - model responses - tool calls - tool outputs - intermediate reasoning if available - errors/retries Useful for debugging why a task passed or failed. 5. Outcome The final result or final environment state after the agent finishes. This is usually more important than the final text. Example: The agent says “meeting booked” is not enough. The actual calendar must contain the correct meeting. 6. Eval harness The system that runs and manages evaluations. It handles: - loading tasks - resetting environments - running trials - collecting transcripts - invoking graders - aggregating metrics Think of it as the test runner for agent evals. 7. Agent harness The wrapper/scaffold that turns a model into an agent. It defines: - available tools - message loop - memory/context handling - stopping rules - tool-calling protocol - system prompt - environment interface