AI Eval Reading Room — LLM and Agent Evaluation Research

Sort by source dateLatest first Earliest first

Filtering by production. Clear filter.

OpenReview — Agent Harness Engineering: A Survey

OpenReview survey · source date 2026-05-14 · 0 comments · original

agent evals harnesses production monitoring deep dive

1. Problems / challenges / motivations - The paper argues that real-world LLM-agent reliability is often constrained less by the base model than by the execution harness around it: environment, tools, context, orchestration, observability, evaluation, and governance. - Prompt engineering and context engineering are no longer enough for production agents....

Adaline — Evaluating AI Agents In 2026: Benchmarks For Teams

industry blog · source date 2026-05-07 · 0 comments · original

agent evals benchmarks production

1. Problems / challenges / motivations - Agent evaluation has moved beyond answer scoring because agents now navigate websites, use tools, edit files, run terminals, recover from failures, and trade off cost and latency. - Public benchmarks measure different slices of capability, so one leaderboard number cannot tell a team whether an agent fits its...

Anthropic — An update on recent Claude Code quality reports

engineering postmortem · source date 2026-04-23 · 0 comments · original

coding agents production monitoring

1. Problems / challenges / motivations - Anthropic describes Claude Code quality regressions caused by product-layer changes rather than a simple base-model failure. - Changes to reasoning effort, caching, and prompt instructions affected user experience in ways internal evals did not initially reproduce. - This exposes a common production-eval gap: offline...

AWS — Evaluating AI agents: real-world lessons from Amazon

engineering blog · source date 2026-02-18 · 0 comments · original

agent evals production monitoring governance

1. Problems / challenges / motivations - Production agents fail in ways that final-answer evals do not explain: wrong tool choice, weak memory retrieval, multi-step drift, brittle recovery, or incomplete task execution. - Black-box LLM scoring is insufficient when agent behavior depends on orchestration, tools, business rules, and runtime context. - Large...

Microsoft — Introducing the Evals for Agent Interop starter kit

engineering blog · source date 2026-01-26 · 0 comments · original

agent evals governance production

1. Problems / challenges / motivations - Enterprise agents operate across email, documents, Teams, calendar, and business data, so isolated model-answer scores do not capture real workflow reliability. - Organizations need evals that reflect local policies, schemas, permissions, and business constraints rather than generic public leaderboard tasks. -...

Reading room

OpenReview — Agent Harness Engineering: A Survey

Adaline — Evaluating AI Agents In 2026: Benchmarks For Teams

Anthropic — An update on recent Claude Code quality reports

AWS — Evaluating AI agents: real-world lessons from Amazon

Microsoft — Introducing the Evals for Agent Interop starter kit