AI & Agent Evaluation
475total visitsadmin

Google Research — Building better AI benchmarks: How many raters are enough?

research blog + paper · source date 2026-03-31 · added 2026-05-18 23:09:08 · updated 2026-05-30 17:20:13 · Open original blog

Problems / challenges / motivations

  • Human-backed AI benchmarks often collapse disagreement into a single label even when the task is subjective.
  • Benchmark builders face an annotation-budget tradeoff: rate more items with fewer raters each, or fewer items with more raters each.
  • Too few raters can make model comparisons fragile, especially for toxicity, helpfulness, preference, alignment, and other judgment-heavy tasks.
Google Research benchmark reproducibility article preview
Google Research benchmark reproducibility article preview. Source: original article.

Key ideas

  • Google frames benchmark construction as an N-versus-K allocation problem: number of items versus number of raters per item.
  • The work uses “gold” ratings data and simulation to estimate which allocation produces more reproducible model comparisons.
  • The key point is that benchmark reliability depends on measurement design, not just model choice or metric choice.
  • Human disagreement should be treated as signal about task ambiguity and uncertainty, not automatically erased.

Why it matters for evals

  • This is directly useful for building human-backed evals with limited annotation budgets.
  • If labels hide natural disagreement, small differences between models may be noise rather than real progress.
  • The practical takeaway is to report uncertainty, design for reproducibility, and choose rater allocation based on the eval's decision purpose.

Comments

No comments yet.