AI Eval Deep Dives — LLM and Agent Evaluation Analysis

deep dive

Deep dive — Open-World Evaluations for Measuring Frontier AI Capabilities

academic paper / CRUX · source date 2026-05-19 · original

## Why this matters

deep dive

Deep dive — Anthropic: Demystifying evals for AI agents

engineering blog · source date 2026-01-09 · original

## Why this matters

deep dive

Agent Harness Engineering: A Survey

OpenReview survey · source date 2026-05-14 · original

The paper's core claim is simple: for long-horizon LLM agents, the binding constraint is often not the base model alone. It is the execution harness around the model. A capable model can still fail if the harness gives it poor context, weak tools, no durable state, a brittle sandbox, inadequate observability, shallow verification, or unsafe permissions. Conversely, a fixed model can look substantially better when the harness improves planning, middleware, self-verification, environment setup, or failure recovery.

deep dive

Code as Agent Harness: agent harness engineering as an eval problem

arXiv survey · source date 2026-05-18 · original

The paper's central move is to stop treating code as merely something an agent writes at the end of a task. In modern agent systems, code is also the medium through which the agent thinks, acts, remembers, observes, verifies, and coordinates. A coding agent does not just emit a patch; it navigates a repository, writes scripts, runs tests, calls tools, reads traces, updates state, negotiates permissions, and leaves artifacts that future agents or humans may reuse. That runtime layer is the harness.