Meta-Harness starts from a harness-engineering problem: the same frozen model can perform very differently depending on surrounding code for retrieval, memory, prompt construction, tool loops, and completion logic.
Existing text optimizers often compress experience into scalar scores, short summaries, fixed mutation rules, or recent-window histories, which is too lossy for stateful harnesses.
Manual harness tuning is slow because engineers must inspect failures, infer causal changes, and iterate across code, prompts, traces, and metrics.
Meta-Harness method overview. Source: original article.
2
Key ideas
Meta-Harness runs a propose → evaluate → log loop where a coding-agent proposer reads prior candidates, source code, scores, prompts, tool calls, model outputs, state updates, and traces, then writes new harness code.
Full diagnostic experience is the critical ingredient. In one text-classification ablation, scores-only and scores-plus-summary variants performed much worse than full trace access.
The search evolves by causal diagnosis rather than random mutation; on TerminalBench-2, the proposer isolated harmful prompt-template edits and pivoted toward a safer environment-bootstrap snapshot.
The required eval data includes a hard search set, held-out tests where possible, valid baseline harnesses, machine-readable logs, source snapshots, per-task traces, validation tests, and diffable run history.
3
Why it matters for evals
Eval results are partly measurements of the harness, not only the base model.
If an eval saves only pass/fail scores, it cannot distinguish model failure, context starvation, tool-loop waste, prompt pathology, or environment issues.
The paper reframes eval suites as optimization surfaces: strong harnesses generate diagnostic data that can drive systematic improvement while guarding against leakage and overfitting.
Comments
No comments yet.