AI & Agent Evaluation
575total visitsadmin
deep dive

engineering blog · source date 2026-01-09 · added 2026-05-18 02:01:48 · updated 2026-06-13 12:56:30 · Open original blog

Deep dive — Anthropic: Demystifying evals for AI agents

Source: Anthropic Engineering, published 2026-01-09. Original: https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents

Why this matters

Anthropic's core claim is simple but important: evaluating an agent is not the same as evaluating a chat response. Agents act over many turns, use tools, mutate external state, and can fail in ways that are invisible if you only grade the final message.

The practical implication: a good agent eval should grade the task outcome and the environment state, then use transcript review to understand why the result happened.

Anthropic article header illustration
Anthropic article header illustration. Source: Anthropic.

1. Agent evals have more moving parts than prompt evals

A simple LLM eval is usually: prompt → model response → grader. That works when the target behavior is a single answer.

An agent eval has more structure:

The most useful distinction is transcript vs outcome. A flight agent can say “your flight is booked” while the database shows no reservation. For agent products, the database state is usually more important than the final sentence.

Simple vs multi-turn/agent evaluation structure
Simple vs multi-turn/agent evaluation structure. Source: original article.
Components of evaluations for agents
Components of evaluations for agents. Source: original article.

2. The right grading target is usually the outcome, not the text

Anthropic's strongest engineering advice is to prefer outcome/state checks wherever possible. Examples:

Final-answer checks are still useful, but they are weaker for agents because agents can hallucinate completion or solve the task through an unexpected path.

Judgment: this is the difference between a demo eval and a production eval. Demo evals often ask “did the answer sound right?” Production evals ask “did the system do the job safely and correctly?”

3. Use multiple grader types because each fails differently

Deterministic graders are best when the desired state is exactly checkable. They are fast, stable, cheap, and regression-test friendly. But they can reject valid creative solutions if the test is too narrow.

Model-based graders help with open-ended outputs, qualitative rubrics, and transcript triage. But they need calibration because judge models can drift, be biased, or reward plausible but wrong reasoning.

Human graders are expensive but important for subjective quality and for calibrating LLM judges.

Practical pattern:

4. Non-determinism means one run is not enough

Agent results vary. The article highlights pass@k and pass^k:

These metrics answer different product questions. pass@k is useful when one successful attempt is enough, such as a research assistant where retries are acceptable. pass^k is stricter and matters for customer-facing agents where consistency is part of the product.

Example from the article: with a 75% per-trial success rate and 3 trials, pass^3 is 0.75³ ≈ 42%. A model can look capable but still be unreliable.

pass@k and pass^k diverge as trials increase
pass@k and pass^k diverge as trials increase. Source: original article.

5. Split capability evals from regression evals

Capability evals ask: “Can the agent do this kind of task yet?” They should include hard tasks and should not all be near 100%. Low pass rates are useful because they expose future model improvements.

Regression evals ask: “Did we break behavior that used to work?” They should be stable, fast enough to run often, and tied to real product failures or important user journeys.

This distinction prevents a common mistake: turning every eval into a release gate. Some evals are for learning; others are for blocking bad releases.

The process of creating an effective evaluation
The process of creating an effective evaluation. Source: original article.

6. Evals are one layer, not the whole safety net

Anthropic argues for a layered view: automated evals, production monitoring, A/B tests, user feedback, transcript review, and human studies each catch different failure modes.

This is similar to the Swiss cheese model: every layer has holes, but stacked layers reduce the chance that a failure reaches users unnoticed.

Evaluation methods as layered defenses
Evaluation methods as layered defenses. Source: original article.

What I would reuse from this article

For any agent project, I would turn this into a checklist:

1. Define the task in terms of user-visible success. 2. Make the environment resettable and isolated. 3. Grade state/outcome first; grade final text second. 4. Store transcripts for every trial. 5. Run multiple trials and report both capability and reliability metrics. 6. Keep a separate regression suite from exploratory capability evals. 7. Sample transcripts weekly, especially after score changes. 8. Treat evals like product infrastructure with owners, maintenance, and versioning.

Main caveat

The article is strongest on methodology, not on a universal implementation recipe. Teams still need to decide what state matters, how much nondeterminism is acceptable, and where LLM judges are reliable enough. The hard part is not knowing that outcome checks are better; it is designing outcome checks that match real user value without overfitting to a brittle path.

Bottom line

The article reframes agent evals as systems tests. The unit under test is not just the model; it is the model plus harness, tools, memory, environment, and product assumptions. That is the right mental model for modern agent engineering.

back to short summary