AI Eval Reading Room — LLM and Agent Evaluation Research

Sort by source dateLatest first Earliest first

Filtering by coding agents. Clear filter.

Anthropic — Dynamic workflows in Claude Code

Claude Code docs · source date 2026-06-02 · 0 comments · original

coding agents harnesses agent evals reliability

1. Problems / challenges / motivations - Large coding-agent tasks often exceed what one linear chat can manage. Audits, migrations, and cross-checks need many independent passes, shared structure, and reproducible coordination. - Static hand-written harnesses can become a bottleneck: the right decomposition depends on the repository, task, files, risks, and...

arXiv — Code as Agent Harness: Toward Executable, Verifiable, and Stateful Agent Systems

arXiv survey · source date 2026-05-18 · 0 comments · original

agent evals harnesses coding agents reliability deep dive

1. Problems / challenges / motivations - Modern LLM agents increasingly succeed or fail because of the runtime around the model: tools, code execution, memory, sandboxes, repositories, validators, permissions, traces, and feedback loops. - Final task success is too flat for this world. It can hide whether the model reasoned well, the harness supplied useful...

Anthropic — An update on recent Claude Code quality reports

engineering postmortem · source date 2026-04-23 · 0 comments · original

coding agents production monitoring

1. Problems / challenges / motivations - Anthropic describes Claude Code quality regressions caused by product-layer changes rather than a simple base-model failure. - Changes to reasoning effort, caching, and prompt instructions affected user experience in ways internal evals did not initially reproduce. - This exposes a common production-eval gap: offline...

arXiv — Meta-Harness: End-to-End Optimization of Model Harnesses

arXiv paper · source date 2026-03-30 · 0 comments · original

harnesses agent evals coding agents

1. Problems / challenges / motivations - Meta-Harness starts from a harness-engineering problem: the same frozen model can perform very differently depending on surrounding code for retrieval, memory, prompt construction, tool loops, and completion logic. - Existing text optimizers often compress experience into scalar scores, short summaries, fixed...

Anthropic — Harness design for long-running application development

engineering blog · source date 2026-03-24 · 0 comments · original

coding agents harnesses reliability

1. Problems / challenges / motivations - Long-running coding and frontend-generation agents degrade as context fills, coherence drops, and models develop “context anxiety.” - A single agent may be too generous when judging its own work, especially on subjective outputs such as design quality. - For long tasks, the surrounding harness can matter as much as...

OpenAI Developers — Run long horizon tasks with Codex

developer blog · source date 2026-02-23 · 1 comments · original

coding agents reliability agent evals

1. Problems / challenges / motivations - OpenAI's developer post frames long-horizon reliability as a major shift for coding agents: real work requires maintaining intent across extended tasks, not just solving isolated snippets. - Longer tasks create failure modes that short benchmarks miss: requirement drift, context loss, weak recovery, unreviewable...

Anthropic — Quantifying infrastructure noise in agentic coding evals

engineering blog · source date 2026-02-05 · 0 comments · original

coding agents benchmarks reliability

1. Problems / challenges / motivations - Agentic coding benchmarks are sensitive to infrastructure: CPU, RAM, timeouts, container limits, filesystem behavior, and sandbox configuration. - Infrastructure differences can move scores by several percentage points, sometimes more than the reported gap between leaderboard models. - Strict resource ceilings can...

Vercel — AGENTS.md outperforms skills in our agent evals

engineering blog · source date 2026-01-27 · 0 comments · original

coding agents harnesses benchmarks

1. Problems / challenges / motivations - Vercel wanted coding agents to use version-matched Next.js 16 documentation, but optional knowledge packages only help if the agent actually invokes them. - A support system can look good in theory while failing at the trigger layer: the agent may not know when to load a skill, may load it too late, or may be...

Reading room