deep dive

arXiv survey · source date 2026-05-18 · added 2026-05-31 17:04:35 · updated 2026-05-31 22:24:31 · Open original blog

Code as Agent Harness: agent harness engineering as an eval problem

Source: arXiv, published 2026-05-18. Original: https://arxiv.org/abs/2605.18747

The paper's central move is to stop treating code as merely something an agent writes at the end of a task. In modern agent systems, code is also the medium through which the agent thinks, acts, remembers, observes, verifies, and coordinates. A coding agent does not just emit a patch; it navigates a repository, writes scripts, runs tests, calls tools, reads traces, updates state, negotiates permissions, and leaves artifacts that future agents or humans may reuse. That runtime layer is the harness.

This matters for AI evaluation because many benchmark reports still collapse the whole process into one final score. A final pass/fail label tells us whether the attempt appeared to work, but it does not tell us whether the harness exposed the right context, whether the tools were reliable, whether the verification oracle was adequate, whether the agent recovered from errors, or whether the run is replayable enough to debug. The paper is best read as a survey of why harness engineering has become an evaluation problem.

Taxonomy of code as agent harness. Source: original article.

1. The core thesis: code is becoming the harness

The survey defines an agent harness as the software layer around a language model: tools, APIs, memory, sandboxes, validators, permissions, execution loops, and feedback channels. Its sharper claim is that code is a natural substrate for this layer because code is executable, inspectable, and stateful.

Executable means that an agent's intermediate output can become an operation, not just a string. A generated script can query a repository, call an API, run a migration, test a hypothesis, or control a robot. Inspectable means the harness can read the intermediate artifact: source code, command logs, compiler errors, stack traces, unit-test failures, diffs, or coverage reports. Stateful means progress can live outside the model context window: in files, repositories, logs, databases, test suites, memory stores, and structured workspaces.

That is the key evaluation shift. If code is the harness, then the object being evaluated is not only the model. It is the coupled system formed by model, prompt, memory, tools, execution environment, validators, permissions, logs, and human review.

2. Three layers of the survey

The paper organizes the field into three connected layers.

First, the harness interface asks how code connects the agent to the world. Code supports reasoning by externalizing intermediate computation into executable steps. It supports acting by turning model intent into commands, API calls, GUI actions, robot policies, or task-specific tools. It supports environment modeling by representing state through repositories, simulations, test fixtures, traces, and executable world models.

Second, harness mechanisms explain how long-horizon agents keep working after the first action. Planning decomposes intent into state transitions. Memory and context engineering decide what evidence remains active, what gets compacted, and what becomes durable state. Tool use turns the harness into a controlled action surface. Feedback-driven control loops decide whether the agent should continue, repair, escalate, or stop.

Third, scaling the harness moves from one agent to multi-agent systems. Managers, planners, coders, reviewers, testers, and critics need a shared substrate. The paper argues that code repositories, tests, execution traces, and shared state can become that substrate, but only if the harness can keep state consistent and make conflicts explicit.

Code as the harness interface. Source: original article.

3. Why final task success is not enough

A final success metric mixes together too many causes. A task may pass because the model was capable, because the scaffold gave away the solution, because the tests were weak, because the environment was forgiving, or because the grader checked only a narrow proxy. A task may fail because the model was weak, because context retrieval starved it of relevant files, because a sandbox blocked necessary commands, because the verifier was flaky, or because the harness lost state across turns.

The survey's eval implication is that an agent run should be treated like an auditable episode. The useful artifact is not just the answer; it is the sequence of plans, tool calls, file edits, state transitions, test results, error messages, retries, permission decisions, and final evidence. Without that episode package, failure attribution is mostly guesswork.

For ai-eval.org, this connects directly to the difference between leaderboard evaluation and production evaluation. Leaderboards usually ask whether the task was solved. Production teams also need to know why, at what cost, under which assumptions, and whether the behavior will survive a harness or product change.

4. Plan–Execute–Verify as the control loop

The most eval-relevant mechanism in the paper is the Plan–Execute–Verify loop. The agent first externalizes an intended state transition. The harness then executes the action in a bounded environment. Finally, verification sensors inspect the resulting state and decide whether to accept, repair, retry, or escalate.

This is broader than ordinary debugging. Planning is contract formation: what should change, what should remain invariant, and what evidence will count as success. Execution is permissioned state transition: what the agent is allowed to touch, where it runs, and what side effects are reversible. Verification is sensing: unit tests, integration tests, linters, static analyzers, fuzzers, runtime monitors, CI, human review, and domain-specific validators.

Harness control through the Plan–Execute–Verify loop. Source: original article.

The important point is that verification is not one tool. It is a stack. A unit test can catch one behavioral regression but miss a security issue. A linter can catch syntax or style but miss semantic wrongness. A browser-state check can confirm that a button exists but not that the user journey is safe. A human reviewer can interpret intent but may be slow, inconsistent, or unavailable. Good harness evaluation should therefore ask what the verifier stack can and cannot see.

5. Emphasis: Observability and Operations (O)

The Observability and Operations layer is where agent eval becomes practical engineering. An agent harness cannot be improved if it does not expose what happened. The paper repeatedly points to traces, logs, execution feedback, repository state, diagnostics, and monitors as the raw material for harness control.

Observability means the harness records enough information to reconstruct a run: prompts, retrieved context, tool calls, command outputs, edited files, diffs, tests run, failures observed, retries attempted, permission decisions, human interventions, and final artifacts. Operations means those signals are not passive logs. They drive debugging, incident triage, rollback, escalation, cost control, safety gates, and regression analysis.

For evaluation, this suggests several concrete metrics beyond pass rate:

Trace completeness: can a reviewer reconstruct the trajectory without asking the agent what happened?
Failure attribution: does the trace distinguish model error, retrieval error, tool failure, sandbox limitation, flaky test, weak oracle, or ambiguous task?
Operational cost: how many tokens, tool calls, commands, wall-clock minutes, retries, and human interventions were required?
State visibility: can the harness show the relevant repository state, memory state, environment state, and verifier state at each decision point?
Alertability: did the harness surface risky actions, policy violations, destructive edits, or repeated failures before they reached users?
Replayability: can the run be replayed or reduced to a minimal failing case?

This is the part of the survey that matters most for real product teams. If observability is weak, every eval result is a black box. A score moves, but no one knows whether the change came from the model, prompt, context pipeline, tool layer, environment, verifier, or product default. Strong observability turns evals into diagnosis.

6. Emphasis: Verification and Evaluation (V)

Verification and Evaluation is the second critical layer. The paper's strongest warning is oracle adequacy: executable feedback can create false confidence. Code can run, tests can pass, traces can look clean, and the agent can still be wrong because the oracle checked the wrong thing.

This matters especially for agent benchmarks. A benchmark often defines success through tests, scripts, or grader rules. But those rules are a proxy for the real task. If the proxy is narrow, an agent can overfit to it, bypass it, or satisfy it while violating user intent. If the proxy is too broad or subjective, results become noisy and hard to reproduce. The harness therefore needs explicit verification scope: what was checked, what was not checked, and how much confidence the evidence supports.

The survey implies a more useful reporting format for agent evals:

Outcome: did the task succeed under the benchmark's stated criteria?
Oracle scope: which properties were actually verified?
Oracle gaps: what important properties were not checked?
Evidence strength: were there deterministic checks, model judges, human review, or production-like state checks?
Regression coverage: would the suite catch common ways the harness or agent could get worse?
Safety coverage: were destructive actions, permission boundaries, and policy constraints tested?
Semantic coverage: did verification check user intent, not only executable behavior?

For coding agents, that means passing tests should not be the whole story. The eval should also record whether tests were added or weakened, whether unrelated files changed, whether the fix generalizes, whether the agent preserved project conventions, whether security or performance regressed, and whether the final patch is maintainable. For browser or OS agents, it means checking final UI state, side effects, user data, and policy compliance. For scientific agents, it means checking whether executable analysis is valid, whether assumptions are recorded, and whether conclusions follow from the evidence.

7. Self-evolving harnesses raise the stakes

The survey also discusses adaptive harness optimization: agents that modify the harness itself, not just the task artifact. They may change prompts, retrieval, memory, tool selection, validators, workflows, or orchestration patterns. This is powerful but dangerous. A self-improving harness can also self-overfit.

Agentic harness engineering for adaptive harness optimization. Source: original article.

The evaluation rule should be strict: harness changes need evidence-carrying commits. Each change should state what failure it is meant to fix, what behavior it predicts will improve, what invariants must remain true, and which held-out or regression tests guard against overfitting. Otherwise, harness evolution becomes prompt tinkering with no causal discipline.

This is where Observability and Verification meet. Observability supplies the evidence for why a harness change was made. Verification checks whether the change helped without breaking something else. Operations supplies rollback, canaries, audits, and human approval when the change affects risky behavior.

8. Multi-agent systems need shared-state evaluation

The paper's multi-agent discussion is important because many agent systems now split work across role-specialized agents: manager, planner, coder, reviewer, tester, critic, security analyst, or domain expert. The hard part is not only making agents talk. It is keeping their shared state consistent.

Multi-agent orchestration over code. Source: original article.

A shared code harness gives agents something concrete to coordinate around: repository state, diffs, tests, logs, and execution results. But this creates new evaluation questions. Did agents work from the same assumptions? Did one agent overwrite another's useful change? Did the reviewer inspect the actual final diff or an outdated summary? Did the tester verify the intended behavior or only run a shallow check? Did the manager preserve the user's original requirement across the workflow?

Good multi-agent evals should therefore measure semantic merge quality, conflict recurrence, rollback frequency, reviewer catch rate, shared-state freshness, and human-intervention burden. A multi-agent system that produces more messages but weaker state consistency is not an improvement.

9. Application domains broaden the problem

The survey extends code-as-harness beyond coding assistants. GUI and OS agents use code-like interfaces, accessibility trees, scripts, and state checks. Scientific agents use notebooks, simulations, analysis scripts, lab protocols, and data pipelines. Embodied agents use policies, controllers, simulators, and robot skills. Personalization agents use preference state, feedback loops, and adaptive policies.

Code as harness across application domains. Source: original article.

The common pattern is that agent reliability depends on the quality of the external substrate. If the substrate is executable, inspectable, and stateful, the agent can be evaluated through its interactions with that substrate. If the substrate is opaque, mutable, or weakly verified, the eval becomes less trustworthy.

10. Main caveat

This paper is a survey and taxonomy, not an empirical benchmark result. It does not prove that code-as-harness systems outperform alternatives. Its value is conceptual: it names the infrastructure that increasingly determines agent behavior and gives a map for measuring it.

The caveat matters because harness language can become too broad. If every tool, file, log, memory, and protocol is called “the harness,” the term risks losing precision. The useful version is operational: identify the concrete components that change what the agent can observe, do, remember, verify, and recover from. Then measure those components.

Bottom line

The paper's best contribution to AI eval is the claim that harnesses must become explicit units of measurement. A serious agent eval should report more than final success. It should report observability quality, operational cost, verifier scope, oracle gaps, state consistency, recovery behavior, safety gating, replayability, and regression risk.

For production agent systems, Observability and Operations (O) and Verification and Evaluation (V) are not add-ons. They are the difference between a demo and an engineering system. Observability tells you what happened. Operations lets you respond safely. Verification tells you whether the result should be trusted. Evaluation tells you whether the whole model–harness–environment system is getting better or merely getting better at passing the current test.

back to short summary