AI & Agent Evaluation
475total visitsadmin

arXiv — AgentAtlas: Beyond Outcome Leaderboards for LLM Agents

arXiv paper · source date 2026-05-19 · added 2026-05-22 18:08:41 · updated 2026-05-30 17:20:13 · Open original blog

Problems / challenges / motivations

  • Outcome leaderboards are too flat: one pass/fail score hides whether an agent chose the right action, used tools safely, or recovered after an error.
  • Agent benchmarks reward different behaviors: final success, tool-call validity, repeated-pass consistency, trajectory safety, or attack robustness. That makes comparisons noisy.
  • Prompt scaffolding can inflate scores if the model receives label menus, rubrics, or strong hints that would not exist in deployment.

Key ideas

  • AgentAtlas defines six control states: Act, Ask, Refuse, Stop, Confirm, and Recover. These labels describe decision moments, not just final answers.
  • Its failure taxonomy labels trajectories by primary error source and impact, distinguishing bad planning, tool misuse, context loss, unsafe action, and harmless detours.
  • The paper tests scaffolding sensitivity: removing explicit label menus dropped trajectory accuracy by 14–40 percentage points in the demonstration.
  • It audits 15 agent benchmarks against behavioral axes to show which suites miss clarification, refusal, recovery, or tool-context retention.

Why it matters for evals

  • AgentAtlas moves evals from ranking to diagnosis: teams learn what behavior failed, not only which model won.
  • It makes prompt scaffolding visible, which helps prevent benchmark reports from overstating deployable capability.
  • The practical regression pattern is to track Ask, Refuse, Recover, trajectory diagnosis, and tool-context retention separately instead of relying on one aggregate pass rate.

Comments

No comments yet.