arXiv — Open-World Evaluations / CRUX for Measuring Frontier AI Capabilities

academic paper / CRUX · source date 2026-05-19 · added 2026-05-22 18:08:41 · updated 2026-06-18 03:30:48 · Open original blog

Standard benchmarks favor tasks that are short, fixed, cheap, and automatically graded. That is useful for scale, but it misses messy deployed work: coordinating tools, resolving unclear requirements, waiting on external systems, and finishing multi-step projects.
Benchmarks can overstate and understate capability. A model can optimize for static leaderboards yet fail in real workflows; another can complete practical work that no benchmark captures. Frontier evals need early signals before standardized suites catch up.
Long-horizon real-world tasks are hard to measure: they are expensive, small-sample, and partly qualitative. But those are often the conditions where new agent capabilities first matter.

Open-world evaluations complement benchmarks: run messy real-world tasks and analyze the process carefully. Example: instead of isolated coding questions, ask an agent to build and publish a simple iOS app, including packaging, metadata, and App Store submission.
CRUX is a recurring evaluation program: Collaborative Research for Updating AI eXpectations aims to repeat these capability probes over time, producing evidence rather than a single leaderboard score.
First CRUX case study: the agent completed the iOS app publishing task with only one avoidable manual intervention. That shows a kind of end-to-end capability that short automated tests may miss.
Reporting matters: the useful artifact is not just success/failure, but what happened, where humans intervened, what the agent handled alone, and where bottlenecks remained.

It adds an early-warning layer: messy project completion can reveal frontier capabilities before they appear in common benchmarks.
It clarifies the tradeoff: benchmarks give scale, repeatability, and trend lines; open-world evals give realism and failure-mode discovery. Strong eval programs need both.
It changes evidence collection: transcripts, artifacts, human interventions, costs, final deployed state, and task narrative become central to judging agent performance.

Comments

No comments yet.