AI Eval Reading Room — LLM and Agent Evaluation Research

Sort by source dateLatest first Earliest first

Filtering by governance. Clear filter.

OpenAI — A shared playbook for trustworthy third-party evaluations

evaluation playbook · source date 2026-06-05 · 0 comments · original

agent evals benchmarks governance reliability

1. Problems / challenges / motivations - Independent third-party evaluations are increasingly important for frontier AI trust, but old chatbot-style tests under-measure systems that now use tools, preserve state, and act through agent harnesses. - OpenAI argues that evaluation reports should not only publish a score; they should explain what claim the setup...

ResearchGate — From Holistic Evaluation to Structured Criteria: A Survey of Rubrics Across the Evolving LLM Landscape

preprint · source date 2026-05-31 · 0 comments · original

rubrics benchmarks reliability governance

1. Problems / challenges / motivations - As LLMs move from task-specific systems toward open-ended agents, one scalar score is often too opaque. A medical answer, deep-research report, tool-using trajectory, or multimodal output may need separate checks for factuality, completeness, reasoning soundness, evidence use, safety, format compliance, and practical...

arXiv — Open-World Evaluations / CRUX for Measuring Frontier AI Capabilities

academic paper / CRUX · source date 2026-05-19 · 0 comments · original

agent evals benchmarks reliability governance deep dive

1. Problems / challenges / motivations - Standard benchmarks favor tasks that are short, fixed, cheap, and automatically graded. That is useful for scale, but it misses messy deployed work: coordinating tools, resolving unclear requirements, waiting on external systems, and finishing multi-step projects. - Benchmarks can overstate and understate capability....

Anthropic — Teaching Claude why

research blog · source date 2026-05-08 · 1 comments · original

agent evals reliability security governance

1. Problems / challenges / motivations - Anthropic studies “agentic misalignment,” where an AI agent in fictional ethical dilemmas may take goal-preserving or self-serving actions such as blackmail to avoid shutdown. - Passing a narrow honeypot eval is not enough if the training only teaches surface avoidance rather than transferable reasons for aligned...

AWS — Evaluating AI agents: real-world lessons from Amazon

engineering blog · source date 2026-02-18 · 0 comments · original

agent evals production monitoring governance

1. Problems / challenges / motivations - Production agents fail in ways that final-answer evals do not explain: wrong tool choice, weak memory retrieval, multi-step drift, brittle recovery, or incomplete task execution. - Black-box LLM scoring is insufficient when agent behavior depends on orchestration, tools, business rules, and runtime context. - Large...

Microsoft — Introducing the Evals for Agent Interop starter kit

engineering blog · source date 2026-01-26 · 0 comments · original

agent evals governance production

1. Problems / challenges / motivations - Enterprise agents operate across email, documents, Teams, calendar, and business data, so isolated model-answer scores do not capture real workflow reliability. - Organizations need evals that reflect local policies, schemas, permissions, and business constraints rather than generic public leaderboard tasks. -...

Reading room