Anthropic — Harness design for long-running application development

engineering blog · source date 2026-03-24 · added 2026-05-17 19:43:41 · updated 2026-05-30 17:20:13 · Open original blog

Long-running coding and frontend-generation agents degrade as context fills, coherence drops, and models develop “context anxiety.”
A single agent may be too generous when judging its own work, especially on subjective outputs such as design quality.
For long tasks, the surrounding harness can matter as much as the base model because it controls context, handoffs, critique, and iteration.

Anthropic discusses planner, generator, and evaluator structures for long-running application-development tasks.
Context resets plus structured handoff artifacts can outperform simple compaction because they preserve intent without carrying every stale token forward.
Separating generator and evaluator roles improves iteration by making critique more independent.
Subjective tasks require concrete rubrics that turn taste into gradable criteria.
Harness design can materially change frontier-agent performance even when the underlying model stays the same.

The article shows why evals must measure harness architecture, not only model capability.
For long-horizon agent evals, traces, handoff artifacts, evaluator rubrics, and iteration loops are part of the measured system.
The reusable lesson is that better agent performance may come from better orchestration: clearer roles, cleaner context resets, stronger review, and rubrics tied to user-visible quality.

Comments

No comments yet.