arXiv — AgentAtlas: Beyond Outcome Leaderboards for LLM Agents
arXiv paper · source date 2026-05-19 · added 2026-05-22 18:08:41 · updated 2026-05-30 17:20:13 · Open original blog
1
Problems / challenges / motivations
- Outcome leaderboards are too flat: one pass/fail score hides whether an agent chose the right action, used tools safely, or recovered after an error.
- Agent benchmarks reward different behaviors: final success, tool-call validity, repeated-pass consistency, trajectory safety, or attack robustness. That makes comparisons noisy.
- Prompt scaffolding can inflate scores if the model receives label menus, rubrics, or strong hints that would not exist in deployment.
2
Key ideas
- AgentAtlas defines six control states: Act, Ask, Refuse, Stop, Confirm, and Recover. These labels describe decision moments, not just final answers.
- Its failure taxonomy labels trajectories by primary error source and impact, distinguishing bad planning, tool misuse, context loss, unsafe action, and harmless detours.
- The paper tests scaffolding sensitivity: removing explicit label menus dropped trajectory accuracy by 14–40 percentage points in the demonstration.
- It audits 15 agent benchmarks against behavioral axes to show which suites miss clarification, refusal, recovery, or tool-context retention.
3
Why it matters for evals
- AgentAtlas moves evals from ranking to diagnosis: teams learn what behavior failed, not only which model won.
- It makes prompt scaffolding visible, which helps prevent benchmark reports from overstating deployable capability.
- The practical regression pattern is to track Ask, Refuse, Recover, trajectory diagnosis, and tool-context retention separately instead of relying on one aggregate pass rate.
Comments
No comments yet.