Anthropic — Quantifying infrastructure noise in agentic coding evals
engineering blog · source date 2026-02-05 · added 2026-05-17 19:43:41 · updated 2026-05-30 17:20:13 · Open original blog
1
Problems / challenges / motivations
- Agentic coding benchmarks are sensitive to infrastructure: CPU, RAM, timeouts, container limits, filesystem behavior, and sandbox configuration.
- Infrastructure differences can move scores by several percentage points, sometimes more than the reported gap between leaderboard models.
- Strict resource ceilings can create failures unrelated to model capability, while generous resources can change the task by enabling approaches that would fail elsewhere.
2
Key ideas
- Anthropic reports that Terminal-Bench 2.0 scores shifted by up to about 6 percentage points under different infrastructure configurations.
- Resource limits should be treated as part of the benchmark definition, not as incidental deployment details.
- Eval reports should document runtime settings, sandbox constraints, timeout policies, and infrastructure error rates alongside model and prompt settings.
- Teams should distinguish model failures from infra failures through logs, retries, and explicit failure categories.
3
Why it matters for evals
- The article is essential for interpreting coding-agent leaderboards: if the runtime differs, agents are not taking the same test.
- It pushes eval practice toward reproducibility and measurement hygiene, especially for tool-using agents.
- The operational lesson is to standardize infrastructure where possible and report it clearly where standardization is impossible.
Comments
No comments yet.