Google Research — Building better AI benchmarks: How many raters are enough?
research blog + paper · source date 2026-03-31 · added 2026-05-18 23:09:08 · updated 2026-05-30 17:20:13 · Open original blog
1
Problems / challenges / motivations
- Human-backed AI benchmarks often collapse disagreement into a single label even when the task is subjective.
- Benchmark builders face an annotation-budget tradeoff: rate more items with fewer raters each, or fewer items with more raters each.
- Too few raters can make model comparisons fragile, especially for toxicity, helpfulness, preference, alignment, and other judgment-heavy tasks.
Google Research benchmark reproducibility article preview. Source: original article.
2
Key ideas
- Google frames benchmark construction as an N-versus-K allocation problem: number of items versus number of raters per item.
- The work uses “gold” ratings data and simulation to estimate which allocation produces more reproducible model comparisons.
- The key point is that benchmark reliability depends on measurement design, not just model choice or metric choice.
- Human disagreement should be treated as signal about task ambiguity and uncertainty, not automatically erased.
3
Why it matters for evals
- This is directly useful for building human-backed evals with limited annotation budgets.
- If labels hide natural disagreement, small differences between models may be noise rather than real progress.
- The practical takeaway is to report uncertainty, design for reproducibility, and choose rater allocation based on the eval's decision purpose.
Comments
No comments yet.