OpenReview — Agent Harness Engineering: A Survey

OpenReview survey · source date 2026-05-14 · added 2026-05-31 17:14:10 · updated 2026-05-31 17:14:10 · Open original blog

The paper argues that real-world LLM-agent reliability is often constrained less by the base model than by the execution harness around it: environment, tools, context, orchestration, observability, evaluation, and governance.
Prompt engineering and context engineering are no longer enough for production agents. Long-horizon agents need infrastructure that controls execution, preserves state, captures traces, validates outcomes, and limits unsafe actions.
Model-only leaderboards can be misleading because scores are properties of a model–harness pair. Tool access, context policy, sandbox setup, verifier strength, retry logic, and monitoring can all change measured capability.

The survey proposes ETCLOVG, a seven-layer taxonomy: Execution environment, Tool interface, Context management, Lifecycle/orchestration, Observability/operations, Verification/evaluation, and Governance/security.
It frames agent engineering as a historical shift from prompt engineering to context engineering to harness engineering, where the main engineering object becomes the runtime control system.
The paper maps a large public corpus of agent-harness projects and uses production lessons from OpenAI, Anthropic, LangChain, and related systems to identify recurring design patterns.
Section 7, Observability and Operations, treats tracing, monitoring, cost tracking, reliability engineering, failure analysis, and production debugging as first-class harness concerns.
Section 8, Verification and Evaluation, organizes evaluation as a task-to-feedback lifecycle: grounding, readiness validation, controlled execution and trace capture, judgement and failure attribution, and continuous regression feedback.

The paper is directly useful for AI eval because it says the unit of evaluation should be the agent episode, not only the final answer.
Eval reports should disclose harness configuration: model, tools, context policy, memory, permissions, sandbox, budgets, timeout, verifier, trace schema, and evaluator version.
Observability turns evals into diagnosis: traces should explain whether failure came from the model, retrieval, tool interface, sandbox, orchestration, verifier, benchmark spec, or human handoff.
Verification and Evaluation turn traces into quality control: agent evals should measure final outcome, trajectory quality, failure attribution, evaluator reliability, regression risk, cost, latency, and deployment feedback.

Comments

No comments yet.