arXiv — Code as Agent Harness: Toward Executable, Verifiable, and Stateful Agent Systems

arXiv survey · source date 2026-05-18 · added 2026-05-31 17:04:35 · updated 2026-05-31 17:04:35 · Open original blog

Modern LLM agents increasingly succeed or fail because of the runtime around the model: tools, code execution, memory, sandboxes, repositories, validators, permissions, traces, and feedback loops.
Final task success is too flat for this world. It can hide whether the model reasoned well, the harness supplied useful context, tests were adequate, state stayed synchronized, or verification merely accepted a narrow proxy.
The paper argues that code is no longer only an output generated by agents. Code increasingly becomes the executable, inspectable, stateful substrate through which agents reason, act, observe, verify, and coordinate.

The survey organizes “code as agent harness” into three layers: harness interface, harness mechanisms, and harness scaling.
The harness interface connects agents to reasoning, action, and environment modeling through executable programs, scripts, DSLs, tests, traces, repositories, and tool APIs.
Harness mechanisms include planning, memory and context engineering, tool use, and Plan–Execute–Verify control loops that turn model intent into bounded state transitions.
Multi-agent harnesses use shared code artifacts, repositories, tests, logs, and execution feedback as the substrate for planners, coders, reviewers, testers, critics, and security agents.
The strongest eval-relevant parts are Observability and Operations: traces, sensors, logs, state snapshots, CI, monitors, coverage, resource limits, permission gates, and incident-style debugging.
Verification and Evaluation are treated as harness responsibilities: deterministic checks, semantic validators, oracle adequacy, replayability, regression control, and evidence packages matter as much as headline pass rates.

The paper reframes agent evaluation as evaluation of a model–harness–environment system, not a model response alone.
Good evals should measure harness-level observability: whether a run leaves enough evidence to reconstruct actions, diagnose failures, attribute causes, and compare versions.
Good evals should also measure verification strength: what the oracle actually checks, where executable feedback is incomplete, whether regressions are caught, and whether safety-critical actions are gated and audited.
The practical takeaway is to report not only success, but also trace quality, verifier coverage, recovery behavior, state consistency, human-intervention burden, replayability, and harness-change regressions.

Comments

No comments yet.