Deep dive — Open-World Evaluations for Measuring Frontier AI Capabilities
## Why this matters
Longer analyses of selected AI eval research posts. The main feed stays short; this category is for articles worth unpacking in more detail.
## Why this matters
## Why this matters
The paper's core claim is simple: for long-horizon LLM agents, the binding constraint is often not the base model alone. It is the execution harness around the model. A capable model can still fail if the harness gives it poor context, weak tools, no durable state, a brittle sandbox, inadequate observability, shallow verification, or unsafe permissions. Conversely, a fixed model can look substantially better when the harness improves planning, middleware, self-verification, environment setup, or failure recovery.
The paper's central move is to stop treating code as merely something an agent writes at the end of a task. In modern agent systems, code is also the medium through which the agent thinks, acts, remembers, observes, verifies, and coordinates. A coding agent does not just emit a patch; it navigates a repository, writes scripts, runs tests, calls tools, reads traces, updates state, negotiates permissions, and leaves artifacts that future agents or humans may reuse. That runtime layer is the harness.