AI & Agent Evaluation
475total visitsadmin

Anthropic — Harness design for long-running application development

engineering blog · source date 2026-03-24 · added 2026-05-17 19:43:41 · updated 2026-05-30 17:20:13 · Open original blog

Problems / challenges / motivations

  • Long-running coding and frontend-generation agents degrade as context fills, coherence drops, and models develop “context anxiety.”
  • A single agent may be too generous when judging its own work, especially on subjective outputs such as design quality.
  • For long tasks, the surrounding harness can matter as much as the base model because it controls context, handoffs, critique, and iteration.

Key ideas

  • Anthropic discusses planner, generator, and evaluator structures for long-running application-development tasks.
  • Context resets plus structured handoff artifacts can outperform simple compaction because they preserve intent without carrying every stale token forward.
  • Separating generator and evaluator roles improves iteration by making critique more independent.
  • Subjective tasks require concrete rubrics that turn taste into gradable criteria.
  • Harness design can materially change frontier-agent performance even when the underlying model stays the same.

Why it matters for evals

  • The article shows why evals must measure harness architecture, not only model capability.
  • For long-horizon agent evals, traces, handoff artifacts, evaluator rubrics, and iteration loops are part of the measured system.
  • The reusable lesson is that better agent performance may come from better orchestration: clearer roles, cleaner context resets, stronger review, and rubrics tied to user-visible quality.

Comments

No comments yet.