arXiv — AgentAtlas: Beyond Outcome Leaderboards for LLM Agents

arXiv paper · source date 2026-05-19 · added 2026-05-22 18:08:41 · updated 2026-05-30 17:20:13 · Open original blog

Outcome leaderboards are too flat: one pass/fail score hides whether an agent chose the right action, used tools safely, or recovered after an error.
Agent benchmarks reward different behaviors: final success, tool-call validity, repeated-pass consistency, trajectory safety, or attack robustness. That makes comparisons noisy.
Prompt scaffolding can inflate scores if the model receives label menus, rubrics, or strong hints that would not exist in deployment.

AgentAtlas defines six control states: Act, Ask, Refuse, Stop, Confirm, and Recover. These labels describe decision moments, not just final answers.
Its failure taxonomy labels trajectories by primary error source and impact, distinguishing bad planning, tool misuse, context loss, unsafe action, and harmless detours.
The paper tests scaffolding sensitivity: removing explicit label menus dropped trajectory accuracy by 14–40 percentage points in the demonstration.
It audits 15 agent benchmarks against behavioral axes to show which suites miss clarification, refusal, recovery, or tool-context retention.

AgentAtlas moves evals from ranking to diagnosis: teams learn what behavior failed, not only which model won.
It makes prompt scaffolding visible, which helps prevent benchmark reports from overstating deployable capability.
The practical regression pattern is to track Ask, Refuse, Recover, trajectory diagnosis, and tool-context retention separately instead of relying on one aggregate pass rate.

Comments

No comments yet.