AI Eval Reading Room — LLM and Agent Evaluation Research

Sort by source dateLatest first Earliest first

OpenAI — A shared playbook for trustworthy third-party evaluations

evaluation playbook · source date 2026-06-05 · 0 comments · original

agent evals benchmarks governance reliability

1. Problems / challenges / motivations - Independent third-party evaluations are increasingly important for frontier AI trust, but old chatbot-style tests under-measure systems that now use tools, preserve state, and act through agent harnesses. - OpenAI argues that evaluation reports should not only publish a score; they should explain what claim the setup...

Anthropic — Dynamic workflows in Claude Code

Claude Code docs · source date 2026-06-02 · 0 comments · original

coding agents harnesses agent evals reliability

1. Problems / challenges / motivations - Large coding-agent tasks often exceed what one linear chat can manage. Audits, migrations, and cross-checks need many independent passes, shared structure, and reproducible coordination. - Static hand-written harnesses can become a bottleneck: the right decomposition depends on the repository, task, files, risks, and...

ResearchGate — From Holistic Evaluation to Structured Criteria: A Survey of Rubrics Across the Evolving LLM Landscape

preprint · source date 2026-05-31 · 0 comments · original

rubrics benchmarks reliability governance

1. Problems / challenges / motivations - As LLMs move from task-specific systems toward open-ended agents, one scalar score is often too opaque. A medical answer, deep-research report, tool-using trajectory, or multimodal output may need separate checks for factuality, completeness, reasoning soundness, evidence use, safety, format compliance, and practical...

arXiv — ProofAgent Harness: Open Infrastructure for Adversarial Evaluation of AI Agents

arXiv paper · source date 2026-05-22 · 0 comments · original

agent evals harnesses security reliability

1. Problems / challenges / motivations - Agent products increasingly use tools, remember context, handle private data, and interact across many turns, so isolated-output grading misses failures that emerge only through trajectory and pressure. - Static benchmarks can hide selective weakness: an agent may look strong on a headline score while failing through...

arXiv — AgentAtlas: Beyond Outcome Leaderboards for LLM Agents

arXiv paper · source date 2026-05-19 · 0 comments · original

agent evals benchmarks reliability

1. Problems / challenges / motivations - Outcome leaderboards are too flat: one pass/fail score hides whether an agent chose the right action, used tools safely, or recovered after an error. - Agent benchmarks reward different behaviors: final success, tool-call validity, repeated-pass consistency, trajectory safety, or attack robustness. That makes...

arXiv — Open-World Evaluations / CRUX for Measuring Frontier AI Capabilities

academic paper / CRUX · source date 2026-05-19 · 0 comments · original

agent evals benchmarks reliability governance deep dive

1. Problems / challenges / motivations - Standard benchmarks favor tasks that are short, fixed, cheap, and automatically graded. That is useful for scale, but it misses messy deployed work: coordinating tools, resolving unclear requirements, waiting on external systems, and finishing multi-step projects. - Benchmarks can overstate and understate capability....

arXiv — Code as Agent Harness: Toward Executable, Verifiable, and Stateful Agent Systems

arXiv survey · source date 2026-05-18 · 0 comments · original

agent evals harnesses coding agents reliability deep dive

1. Problems / challenges / motivations - Modern LLM agents increasingly succeed or fail because of the runtime around the model: tools, code execution, memory, sandboxes, repositories, validators, permissions, traces, and feedback loops. - Final task success is too flat for this world. It can hide whether the model reasoned well, the harness supplied useful...

OpenReview — Agent Harness Engineering: A Survey

OpenReview survey · source date 2026-05-14 · 0 comments · original

agent evals harnesses production monitoring deep dive

1. Problems / challenges / motivations - The paper argues that real-world LLM-agent reliability is often constrained less by the base model than by the execution harness around it: environment, tools, context, orchestration, observability, evaluation, and governance. - Prompt engineering and context engineering are no longer enough for production agents....

Anthropic — Teaching Claude why

research blog · source date 2026-05-08 · 1 comments · original

agent evals reliability security governance

1. Problems / challenges / motivations - Anthropic studies “agentic misalignment,” where an AI agent in fictional ethical dilemmas may take goal-preserving or self-serving actions such as blackmail to avoid shutdown. - Passing a narrow honeypot eval is not enough if the training only teaches surface avoidance rather than transferable reasons for aligned...

Adaline — Evaluating AI Agents In 2026: Benchmarks For Teams

industry blog · source date 2026-05-07 · 0 comments · original

agent evals benchmarks production

1. Problems / challenges / motivations - Agent evaluation has moved beyond answer scoring because agents now navigate websites, use tools, edit files, run terminals, recover from failures, and trade off cost and latency. - Public benchmarks measure different slices of capability, so one leaderboard number cannot tell a team whether an agent fits its...

OpenAI — GPT-5.5 System Card

system card · source date 2026-04-23 · 0 comments · original

1. Problems / challenges / motivations - OpenAI's GPT-5.5 System Card evaluates a model expected to do real work: coding, research, document creation, tool use, and multi-step tasks. - The safety question is broader than chat quality because deployed agentic systems can take actions, interact with tools, and create operational risks. - Offline benchmark...

Anthropic — An update on recent Claude Code quality reports

engineering postmortem · source date 2026-04-23 · 0 comments · original

coding agents production monitoring

1. Problems / challenges / motivations - Anthropic describes Claude Code quality regressions caused by product-layer changes rather than a simple base-model failure. - Changes to reasoning effort, caching, and prompt instructions affected user experience in ways internal evals did not initially reproduce. - This exposes a common production-eval gap: offline...

Google Research — Evaluating alignment of behavioral dispositions in LLMs

research blog + paper · source date 2026-04-03 · 0 comments · original

1. Problems / challenges / motivations - Google Research studies how to evaluate behavioral dispositions such as empathy, assertiveness, composure, and conflict handling in LLMs. - Asking a model to self-report traits is weak evidence because the model can state a preference without showing how it behaves in context. - Alignment on social behavior is...

Google Research — Building better AI benchmarks: How many raters are enough?

research blog + paper · source date 2026-03-31 · 0 comments · original

benchmarks reliability

1. Problems / challenges / motivations - Human-backed AI benchmarks often collapse disagreement into a single label even when the task is subjective. - Benchmark builders face an annotation-budget tradeoff: rate more items with fewer raters each, or fewer items with more raters each. - Too few raters can make model comparisons fragile, especially for...

arXiv — Meta-Harness: End-to-End Optimization of Model Harnesses

arXiv paper · source date 2026-03-30 · 0 comments · original

harnesses agent evals coding agents

1. Problems / challenges / motivations - Meta-Harness starts from a harness-engineering problem: the same frozen model can perform very differently depending on surrounding code for retrieval, memory, prompt construction, tool loops, and completion logic. - Existing text optimizers often compress experience into scalar scores, short summaries, fixed...

Anthropic — Harness design for long-running application development

engineering blog · source date 2026-03-24 · 0 comments · original

coding agents harnesses reliability

1. Problems / challenges / motivations - Long-running coding and frontend-generation agents degrade as context fills, coherence drops, and models develop “context anxiety.” - A single agent may be too generous when judging its own work, especially on subjective outputs such as design quality. - For long tasks, the surrounding harness can matter as much as...

Anthropic — Eval awareness in Claude Opus 4.6’s BrowseComp performance

engineering blog · source date 2026-03-06 · 0 comments · original

benchmarks security reliability

1. Problems / challenges / motivations - Anthropic reports cases where Claude Opus 4.6 inferred it might be inside BrowseComp, searched for benchmark materials, and found or decrypted answer keys. - Web-enabled evaluations are vulnerable to public contamination from papers, blog posts, GitHub repositories, answer keys, and benchmark discussions. - The...

OpenAI Developers — Run long horizon tasks with Codex

developer blog · source date 2026-02-23 · 1 comments · original

coding agents reliability agent evals

1. Problems / challenges / motivations - OpenAI's developer post frames long-horizon reliability as a major shift for coding agents: real work requires maintaining intent across extended tasks, not just solving isolated snippets. - Longer tasks create failure modes that short benchmarks miss: requirement drift, context loss, weak recovery, unreviewable...

AWS — Evaluating AI agents: real-world lessons from Amazon

engineering blog · source date 2026-02-18 · 0 comments · original

agent evals production monitoring governance

1. Problems / challenges / motivations - Production agents fail in ways that final-answer evals do not explain: wrong tool choice, weak memory retrieval, multi-step drift, brittle recovery, or incomplete task execution. - Black-box LLM scoring is insufficient when agent behavior depends on orchestration, tools, business rules, and runtime context. - Large...

Anthropic — Quantifying infrastructure noise in agentic coding evals

engineering blog · source date 2026-02-05 · 0 comments · original

coding agents benchmarks reliability

1. Problems / challenges / motivations - Agentic coding benchmarks are sensitive to infrastructure: CPU, RAM, timeouts, container limits, filesystem behavior, and sandbox configuration. - Infrastructure differences can move scores by several percentage points, sometimes more than the reported gap between leaderboard models. - Strict resource ceilings can...

Vercel — AGENTS.md outperforms skills in our agent evals

engineering blog · source date 2026-01-27 · 0 comments · original

coding agents harnesses benchmarks

1. Problems / challenges / motivations - Vercel wanted coding agents to use version-matched Next.js 16 documentation, but optional knowledge packages only help if the agent actually invokes them. - A support system can look good in theory while failing at the trigger layer: the agent may not know when to load a skill, may load it too late, or may be...

Microsoft — Introducing the Evals for Agent Interop starter kit

engineering blog · source date 2026-01-26 · 0 comments · original

agent evals governance production

1. Problems / challenges / motivations - Enterprise agents operate across email, documents, Teams, calendar, and business data, so isolated model-answer scores do not capture real workflow reliability. - Organizations need evals that reflect local policies, schemas, permissions, and business constraints rather than generic public leaderboard tasks. -...

Anthropic — Designing AI-resistant technical evaluations

engineering blog · source date 2026-01-21 · 0 comments · original

benchmarks reliability security

1. Problems / challenges / motivations - Anthropic's performance-engineering take-home interview lost signal as Claude became strong enough to solve earlier versions of the task. - Static technical evaluations decay when AI assistance improves; a task that once measured human skill can become a test of whether the candidate uses a strong enough model. -...

Anthropic — Demystifying evals for AI agents

engineering blog · source date 2026-01-09 · 1 comments · original

agent evals benchmarks reliability deep dive

1. Problems / challenges / motivations - Agent evals are different from single-turn chat evals because agents use tools, change external state, and may fail across multiple turns even when the final answer sounds correct. - Final-message grading misses the most important question: did the task actually succeed in the environment, database, browser, files,...

Reading room