Anthropic — Quantifying infrastructure noise in agentic coding evals

engineering blog · source date 2026-02-05 · added 2026-05-17 19:43:41 · updated 2026-05-30 17:20:13 · Open original blog

Agentic coding benchmarks are sensitive to infrastructure: CPU, RAM, timeouts, container limits, filesystem behavior, and sandbox configuration.
Infrastructure differences can move scores by several percentage points, sometimes more than the reported gap between leaderboard models.
Strict resource ceilings can create failures unrelated to model capability, while generous resources can change the task by enabling approaches that would fail elsewhere.

Anthropic reports that Terminal-Bench 2.0 scores shifted by up to about 6 percentage points under different infrastructure configurations.
Resource limits should be treated as part of the benchmark definition, not as incidental deployment details.
Eval reports should document runtime settings, sandbox constraints, timeout policies, and infrastructure error rates alongside model and prompt settings.
Teams should distinguish model failures from infra failures through logs, retries, and explicit failure categories.

The article is essential for interpreting coding-agent leaderboards: if the runtime differs, agents are not taking the same test.
It pushes eval practice toward reproducibility and measurement hygiene, especially for tool-using agents.
The operational lesson is to standardize infrastructure where possible and report it clearly where standardization is impossible.

Comments

No comments yet.