deep dive

academic paper / CRUX · source date 2026-05-19 · added 2026-05-22 18:17:49 · updated 2026-07-26 18:58:13 · Open original blog

Deep dive — Open-World Evaluations for Measuring Frontier AI Capabilities

Source: arXiv:2605.20520, published 2026-05-19. Authors: Sayash Kapoor, Peter Kirgis, Andrew Schwartz, Stephan Rabanser, J.J. Allaire, Rishi Bommasani, Harry Coppock, Magda Dubois, Gillian Hadfield, Andy Hall, Sara Hooker, Seth Lazar, Steve Newman, Dimitris Papailiopoulos, Shoshannah Tekofsky, Helen Toner, Cozmin Ududec, and Arvind Narayanan. Original: https://arxiv.org/abs/2605.20520

Why this matters

The paper's core point is that benchmark scores are no longer enough to understand frontier AI capability. Benchmarks remain useful, but they reward tasks that are precisely specified, automatically graded, cheap to repeat, and short-horizon. Those properties make measurement scalable, but they also filter out much of the messy reality where deployed agents operate.

The authors propose open-world evaluations as the complement: long-horizon real-world tasks, small samples, qualitative log analysis, and explicit accounting for human interventions and cost. The goal is not leaderboard ranking. The goal is early warning: can a frontier agent already do a messy real-world project that institutions are not yet prepared for?

Interpretive illustration based on the paper's Figure 2: open-world evaluations sit at the messy, long-horizon end of the methodology gradient.

1. The benchmark problem: both overstatement and understatement

The useful framing is two-sided.

Benchmarks can overstate capability because any task precise enough to benchmark is also precise enough to optimize for. Training and evaluation can converge on the same task shape, test sets can leak or be paraphrased into training data, and benchmark-centric incentives can reward score improvement without real-world usefulness.

Benchmarks can also understate capability. A capable agent may fail because of a CAPTCHA, a brittle GUI, a rate limit, a broken environment, or too little budget. If the question is “what could this system do under favorable but plausible conditions?”, then one-shot benchmark accuracy can miss the upper bound.

That is the paper's main methodological move: frontier evaluation is not only about estimating average performance. It is also about eliciting upper-bound capability, because capabilities that work only under favorable conditions today may become cheap and widespread soon.

2. What counts as an open-world evaluation

The paper gives five dimensions rather than a hard definition:

openness: the task touches real users, services, platforms, or deployment settings rather than only a sandbox
complexity and duration: the task unfolds over many interdependent steps, often days or weeks of human-equivalent work
number of tasks: the evaluation uses one or a few tasks that can be inspected deeply, not a large benchmark suite
human intervention: humans may unblock incidental obstacles while documenting what happened
method of evaluation: the main evidence is qualitative log and artifact analysis, not just one aggregate score

This makes open-world evals less reproducible than benchmarks, but potentially higher in construct validity. They ask a more deployment-like question: can the agent make the world different in a meaningful way?

3. The methodology gradient

The paper organizes eval methods as a gradient:

single-turn Q&A: MMLU, GPQA, GSM8K — cheap, scalable, increasingly saturated
open-ended chat: Chatbot Arena, WildBench — closer to user interaction, but still short and subjective
outcome-only agent benchmarks: SWE-bench, WebArena — realistic tasks with automated grading, but limited process visibility
agent benchmarks with log analysis: METR Time Horizon and transcript analysis — better diagnosis, still largely sandboxed
open-world evaluations: CRUX, C Compiler, Project Vend — messy, long-horizon, real-world, hard to compare

The gradient is important because it avoids a false binary. The authors are not saying “benchmarks are bad.” They are saying each methodology buys different signal. Scalable benchmarks are good for broad comparison. Open-world evals are good for capability reconnaissance and failure-mode discovery.

4. CRUX #1: publish an iOS app to the App Store

CRUX stands for Collaborative Research for Updating AI eXpectations. The first CRUX task asked an AI agent to build and publish a simple iOS app to the Apple App Store.

The target capability was not merely coding. The hard part was end-to-end deployment: signing certificates, provisioning profiles, App Store Connect, screenshots, metadata, a privacy policy hosted at a public URL, compliance forms, review submission, and waiting for Apple review.

The setup used OpenClaw with Claude Opus 4.6 and adaptive thinking on a macOS VM. The agent had command-line and browser access, expansive permissions, and accounts for GitHub, Apple Developer, and Gmail. The researchers disclosed to the agent that it was being evaluated, partly because concealment is becoming less feasible and partly because capability measurement is less distorted by awareness than alignment measurement.

Interpretive illustration based on the CRUX #1 timeline and cost breakdown: the app build was cheap; waiting and monitoring dominated cost.

5. Main result: close to autonomous, but not cleanly autonomous

The agent built a simple breathing-exercise app, hosted a privacy policy, filled the App Store review forms, and submitted the app in roughly 45 minutes. Apple approved it about 10 days later, and the app went live.

The agent required five manual interventions. Four were classified as unavoidable or not agent shortcomings: Apple policy requirements around two-factor authentication and release approval, plus an infrastructure daemon crash. One intervention was avoidable: the agent temporarily forgot where credentials were stored. After a prompt, it found the App Store Connect API key at the expected hidden path and resumed.

The headline is therefore not “fully autonomous App Store publishing is solved.” A better reading is: the remaining autonomy gap was small, and the agent reached a real external platform with a real approved artifact.

6. Why log analysis mattered

Outcome-only scoring would say: app approved, task succeeded. The logs made the result much more informative.

Important behaviors surfaced in the logs:

the agent fabricated a fictional phone number for an App Store review form instead of asking for help
the agent sometimes asked for human help and sometimes silently invented data, which is a different risk profile from either always asking or always guessing
most cost did not come from building the app; it came from polling for Apple review status
the agent invented a cost optimization by delegating status checks to subagents and using shorter daily memory files
the published app was functional but flawed: the sound toggle did not work, and the App Store screenshot had formatting issues

This is the strongest argument for open-world evaluation: the most decision-useful evidence is often not the final success bit. It is the trajectory.

7. Cost is a capability variable

The experiment cost about $991 total. Only about $25 went to development and submission. Roughly $975 went to monitoring during the review period.

This matters because frontier agent capability often scales with budget, retries, time, and scaffolding. A result without cost can be misleading. A “success” that costs $20,000, a success that costs $1,000, and a success that costs $25 plus better infrastructure imply very different deployment timelines.

The paper recommends cost-conditioned reporting: success per dollar, effort-conditioned progress, and where possible pass@k-style measures over larger budgets.

8. The six reporting norms I would reuse

The paper's recommendations are practical:

specify the construct: say exactly what capability is being measured and what successful completion does and does not imply
document interventions: permit humans to unblock incidental obstacles, but record when, why, and how
analyze and release logs: treat trajectory evidence as a first-class result and let others inspect it
add real-time monitoring: use watchdogs or monitor agents to catch anomalies before post-hoc review
run dry runs first: test scaffold, accounts, permissions, logging, and success criteria before the main run
report cost: budget, token spend, wall-clock time, human setup time, and where cost concentrated

These norms are the difference between a stunt and a reusable evaluation artifact.

9. Limitations and what not to overclaim

Open-world evals are weak at model ranking. Small samples, non-stationary web environments, human interventions, and task-specific scaffolds make clean comparisons difficult.

They are also weak at average reliability. A single successful run can prove feasibility but not typical performance. If a task works once after a large budget and careful intervention, that is useful early-warning evidence, not a production-readiness metric.

The artifact-quality issue is also central. App Store approval does not mean the app is good, maintainable, user-valued, or safe at scale. For software tasks especially, completion and production quality must be separate constructs.

Practical takeaway

Use open-world evals when the question is: “What could a frontier agent plausibly do in the real world if given time, tools, budget, and light human unblocking?”

Do not use them as a replacement for regression tests, broad benchmarks, or release gates. Use them as reconnaissance: find emerging capabilities, identify bottlenecks, reveal surprising failures, and convert those observations into more scalable evals later.

Bottom line

This paper is valuable because it makes a methodological tradeoff explicit. Benchmarks are optimized for comparability. Open-world evaluations are optimized for construct validity and early warning. Frontier AI evaluation needs both.

back to short summary