AI & Agent Evaluation
475total visitsadmin

arXiv — Meta-Harness: End-to-End Optimization of Model Harnesses

arXiv paper · source date 2026-03-30 · added 2026-05-28 16:00:27 · updated 2026-05-30 17:20:13 · Open original blog

Problems / challenges / motivations

  • Meta-Harness starts from a harness-engineering problem: the same frozen model can perform very differently depending on surrounding code for retrieval, memory, prompt construction, tool loops, and completion logic.
  • Existing text optimizers often compress experience into scalar scores, short summaries, fixed mutation rules, or recent-window histories, which is too lossy for stateful harnesses.
  • Manual harness tuning is slow because engineers must inspect failures, infer causal changes, and iterate across code, prompts, traces, and metrics.
Meta-Harness method overview
Meta-Harness method overview. Source: original article.

Key ideas

  • Meta-Harness runs a propose → evaluate → log loop where a coding-agent proposer reads prior candidates, source code, scores, prompts, tool calls, model outputs, state updates, and traces, then writes new harness code.
  • Full diagnostic experience is the critical ingredient. In one text-classification ablation, scores-only and scores-plus-summary variants performed much worse than full trace access.
  • The search evolves by causal diagnosis rather than random mutation; on TerminalBench-2, the proposer isolated harmful prompt-template edits and pivoted toward a safer environment-bootstrap snapshot.
  • The required eval data includes a hard search set, held-out tests where possible, valid baseline harnesses, machine-readable logs, source snapshots, per-task traces, validation tests, and diffable run history.

Why it matters for evals

  • Eval results are partly measurements of the harness, not only the base model.
  • If an eval saves only pass/fail scores, it cannot distinguish model failure, context starvation, tool-loop waste, prompt pathology, or environment issues.
  • The paper reframes eval suites as optimization surfaces: strong harnesses generate diagnostic data that can drive systematic improvement while guarding against leakage and overfitting.

Comments

No comments yet.