AI & Agent Evaluation
475total visitsadmin

OpenReview — Agent Harness Engineering: A Survey

OpenReview survey · source date 2026-05-14 · added 2026-05-31 17:14:10 · updated 2026-05-31 17:14:10 · Open original blog

Problems / challenges / motivations

  • The paper argues that real-world LLM-agent reliability is often constrained less by the base model than by the execution harness around it: environment, tools, context, orchestration, observability, evaluation, and governance.
  • Prompt engineering and context engineering are no longer enough for production agents. Long-horizon agents need infrastructure that controls execution, preserves state, captures traces, validates outcomes, and limits unsafe actions.
  • Model-only leaderboards can be misleading because scores are properties of a model–harness pair. Tool access, context policy, sandbox setup, verifier strength, retry logic, and monitoring can all change measured capability.

Key ideas

  • The survey proposes ETCLOVG, a seven-layer taxonomy: Execution environment, Tool interface, Context management, Lifecycle/orchestration, Observability/operations, Verification/evaluation, and Governance/security.
  • It frames agent engineering as a historical shift from prompt engineering to context engineering to harness engineering, where the main engineering object becomes the runtime control system.
  • The paper maps a large public corpus of agent-harness projects and uses production lessons from OpenAI, Anthropic, LangChain, and related systems to identify recurring design patterns.
  • Section 7, Observability and Operations, treats tracing, monitoring, cost tracking, reliability engineering, failure analysis, and production debugging as first-class harness concerns.
  • Section 8, Verification and Evaluation, organizes evaluation as a task-to-feedback lifecycle: grounding, readiness validation, controlled execution and trace capture, judgement and failure attribution, and continuous regression feedback.

Why it matters for evals

  • The paper is directly useful for AI eval because it says the unit of evaluation should be the agent episode, not only the final answer.
  • Eval reports should disclose harness configuration: model, tools, context policy, memory, permissions, sandbox, budgets, timeout, verifier, trace schema, and evaluator version.
  • Observability turns evals into diagnosis: traces should explain whether failure came from the model, retrieval, tool interface, sandbox, orchestration, verifier, benchmark spec, or human handoff.
  • Verification and Evaluation turn traces into quality control: agent evals should measure final outcome, trajectory quality, failure attribution, evaluator reliability, regression risk, cost, latency, and deployment feedback.

read deep dive

Comments

No comments yet.