Google Research — Building better AI benchmarks: How many raters are enough?

research blog + paper · source date 2026-03-31 · added 2026-05-18 23:09:08 · updated 2026-05-30 17:20:13 · Open original blog

Human-backed AI benchmarks often collapse disagreement into a single label even when the task is subjective.
Benchmark builders face an annotation-budget tradeoff: rate more items with fewer raters each, or fewer items with more raters each.
Too few raters can make model comparisons fragile, especially for toxicity, helpfulness, preference, alignment, and other judgment-heavy tasks.

Google frames benchmark construction as an N-versus-K allocation problem: number of items versus number of raters per item.
The work uses “gold” ratings data and simulation to estimate which allocation produces more reproducible model comparisons.
The key point is that benchmark reliability depends on measurement design, not just model choice or metric choice.
Human disagreement should be treated as signal about task ambiguity and uncertainty, not automatically erased.

This is directly useful for building human-backed evals with limited annotation budgets.
If labels hide natural disagreement, small differences between models may be noise rather than real progress.
The practical takeaway is to report uncertainty, design for reproducibility, and choose rater allocation based on the eval's decision purpose.

Comments

No comments yet.