OpenAI — A shared playbook for trustworthy third-party evaluations

evaluation playbook · source date 2026-06-05 · added 2026-06-18 03:30:48 · updated 2026-06-18 04:27:15 · Open original blog

Independent third-party evaluations are increasingly important for frontier AI trust, but old chatbot-style tests under-measure systems that now use tools, preserve state, and act through agent harnesses.
OpenAI argues that evaluation reports should not only publish a score; they should explain what claim the setup was designed to test and what evidence supports the validity of that claim.
The hard part is that harnesses, budgets, tools, scoring rules, safeguards, and review procedures can all materially change the observed result.

OpenAI separates evaluation claims into three buckets: capability elicitation, safeguard performance, and controlled comparison. Each claim type requires a different harness choice and different evidence.
Harness choice is part of the measurement. A strong-elicitation setup may be appropriate for capability ceilings; a shared harness is better for controlled model comparisons; safeguard tests should match the relevant adversary model.
Reports should disclose the tested system, model/tool/harness setup, budget, elicitation strategy, scoring method, and known limitations.
Validity checks should look for reward hacking, refusals, contamination, broken problems, and sandbagging. The article uses examples from METR, UK AISI, Apollo, and OpenAI cyber evaluations to show how these hazards can distort headline results.

The piece makes harness and validity evidence first-class parts of AI evaluation, not implementation details.
It is especially relevant for agent evals: when models act over long trajectories, performance depends on the surrounding scaffold as much as the base model.
For standards and public reporting, the practical takeaway is that trustworthy third-party evals need claim-specific methodology, transparent budgets, and explicit validity checks before their scores can support governance or safety decisions.

Comments

No comments yet.