AI & Agent Evaluation
475total visitsadmin
category / longform

Deep dive

Longer analyses of selected AI eval research posts. The main feed stays short; this category is for articles worth unpacking in more detail.

$ evals.deep_dive --selected
deep dives: 4
mode: longform
deep dive

Agent Harness Engineering: A Survey

OpenReview survey · source date 2026-05-14 · original

The paper's core claim is simple: for long-horizon LLM agents, the binding constraint is often not the base model alone. It is the execution harness around the model. A capable model can still fail if the harness gives it poor context, weak tools, no durable state, a brittle sandbox, inadequate observability, shallow verification, or unsafe permissions. Conversely, a fixed model can look substantially better when the harness improves planning, middleware, self-verification, environment setup, or failure recovery.

deep dive

Code as Agent Harness: agent harness engineering as an eval problem

arXiv survey · source date 2026-05-18 · original

The paper's central move is to stop treating code as merely something an agent writes at the end of a task. In modern agent systems, code is also the medium through which the agent thinks, acts, remembers, observes, verifies, and coordinates. A coding agent does not just emit a patch; it navigates a repository, writes scripts, runs tests, calls tools, reads traces, updates state, negotiates permissions, and leaves artifacts that future agents or humans may reuse. That runtime layer is the harness.