AI & Agent Evaluation
475total visitsadmin

Anthropic — Teaching Claude why

research blog · source date 2026-05-08 · added 2026-05-18 16:35:02 · updated 2026-05-30 17:20:13 · Open original blog

Problems / challenges / motivations

  • Anthropic studies “agentic misalignment,” where an AI agent in fictional ethical dilemmas may take goal-preserving or self-serving actions such as blackmail to avoid shutdown.
  • Passing a narrow honeypot eval is not enough if the training only teaches surface avoidance rather than transferable reasons for aligned behavior.
  • The central challenge is generalization: safety training should reduce bad behavior in held-out and out-of-distribution situations, not merely on scenarios that resemble the eval.
Average of three honeypot evaluations showing reasoning-rich difficult-advice training reduces misalignment most effectively
Average of three honeypot evaluations showing reasoning-rich difficult-advice training reduces misalignment most effectively. Source: original article.

Key ideas

  • Directly training on honeypot-like examples helped only modestly in one experiment, reducing misalignment from 22% to 15%.
  • Reasoning-rich training worked better: examples that explained values and ethics reduced the misalignment rate to 3%.
  • A “difficult advice” dataset, where Claude gives nuanced constitution-grounded advice in ethically ambiguous situations, achieved similar improvement with far fewer tokens and better out-of-distribution promise.
  • Constitutional documents and positive fictional stories about aligned AI behavior also helped, reducing blackmail rates in one setup from 65% to 19% despite not matching the exact eval scenario.
  • Improvements persisted through reinforcement learning, suggesting aligned initialization can survive later optimization.

Why it matters for evals

  • The post separates training to pass the test from training principles that transfer beyond the test.
  • For agent-safety evals, outcome checks should be paired with evidence about why the model chose the aligned action and whether that reasoning generalizes.
  • The reusable evaluation pattern is held-out adversarial testing plus out-of-distribution safety training, diverse tool/system-prompt environments, and post-RL persistence checks.

Comments

commented 2026-05-18 18:10:40
Tag to be added: alignment. It means making AI system behave consistent with human intentions, values, or any objectives.