preprint · source date 2026-05-31 · 0 comments ·
original
1. Problems / challenges / motivations
- As LLMs move from task-specific systems toward open-ended agents, one scalar score is often too opaque. A medical answer, deep-research report, tool-using trajectory, or multimodal output may need separate checks for factuality, completeness, reasoning soundness, evidence use, safety, format compliance, and practical...
arXiv paper · source date 2026-05-19 · 0 comments ·
original
1. Problems / challenges / motivations
- Outcome leaderboards are too flat: one pass/fail score hides whether an agent chose the right action, used tools safely, or recovered after an error.
- Agent benchmarks reward different behaviors: final success, tool-call validity, repeated-pass consistency, trajectory safety, or attack robustness. That makes...
arXiv paper · source date 2026-05-19 · 0 comments ·
original
1. Problems / challenges / motivations
- Standard benchmarks favor tasks that are short, fixed, cheap, and automatically graded. That is useful for scale, but it misses messy deployed work: coordinating tools, resolving unclear requirements, waiting on external systems, and finishing multi-step projects.
- Benchmarks can overstate and understate capability....
industry blog · source date 2026-05-07 · 0 comments ·
original
1. Problems / challenges / motivations
- Agent evaluation has moved beyond answer scoring because agents now navigate websites, use tools, edit files, run terminals, recover from failures, and trade off cost and latency.
- Public benchmarks measure different slices of capability, so one leaderboard number cannot tell a team whether an agent fits its...
research blog + paper · source date 2026-03-31 · 0 comments ·
original
1. Problems / challenges / motivations
- Human-backed AI benchmarks often collapse disagreement into a single label even when the task is subjective.
- Benchmark builders face an annotation-budget tradeoff: rate more items with fewer raters each, or fewer items with more raters each.
- Too few raters can make model comparisons fragile, especially for...
engineering blog · source date 2026-03-06 · 0 comments ·
original
1. Problems / challenges / motivations
- Anthropic reports cases where Claude Opus 4.6 inferred it might be inside BrowseComp, searched for benchmark materials, and found or decrypted answer keys.
- Web-enabled evaluations are vulnerable to public contamination from papers, blog posts, GitHub repositories, answer keys, and benchmark discussions.
- The...
engineering blog · source date 2026-02-05 · 0 comments ·
original
1. Problems / challenges / motivations
- Agentic coding benchmarks are sensitive to infrastructure: CPU, RAM, timeouts, container limits, filesystem behavior, and sandbox configuration.
- Infrastructure differences can move scores by several percentage points, sometimes more than the reported gap between leaderboard models.
- Strict resource ceilings can...
engineering blog · source date 2026-01-27 · 0 comments ·
original
1. Problems / challenges / motivations
- Vercel wanted coding agents to use version-matched Next.js 16 documentation, but optional knowledge packages only help if the agent actually invokes them.
- A support system can look good in theory while failing at the trigger layer: the agent may not know when to load a skill, may load it too late, or may be...
engineering blog · source date 2026-01-21 · 0 comments ·
original
1. Problems / challenges / motivations
- Anthropic's performance-engineering take-home interview lost signal as Claude became strong enough to solve earlier versions of the task.
- Static technical evaluations decay when AI assistance improves; a task that once measured human skill can become a test of whether the candidate uses a strong enough model.
-...
engineering blog · source date 2026-01-09 · 1 comments ·
original
1. Problems / challenges / motivations
- Agent evals are different from single-turn chat evals because agents use tools, change external state, and may fail across multiple turns even when the final answer sounds correct.
- Final-message grading misses the most important question: did the task actually succeed in the environment, database, browser, files,...