system card · source date 2026-04-23 · added 2026-05-18 23:09:08 · updated 2026-05-30 17:20:13 · Open original blog
1
Problems / challenges / motivations
OpenAI's GPT-5.5 System Card evaluates a model expected to do real work: coding, research, document creation, tool use, and multi-step tasks.
The safety question is broader than chat quality because deployed agentic systems can take actions, interact with tools, and create operational risks.
Offline benchmark scores alone cannot establish deployment safety; the system also needs policy compliance tests, adversarial robustness checks, production-derived traces, monitoring, and safeguards.
Figure 5: Internal deployment traffic by misalignment category. Source: original article.
2
Key ideas
The card layers evaluation across policy compliance, vision safety, agentic action safety, jailbreak robustness, prompt injection, health behavior, hallucination, alignment, bias, and Preparedness domains.
Agentic action safety includes confirmation before consequential actions, avoiding accidental destructive changes, preserving user work, and handling tool use under attack.
Alignment evaluation uses representative external ChatGPT prompts and coding-agent deployment traces classified into categories such as circumventing restrictions, concealing uncertainty, destructive actions, deception, reward hacking, and unauthorized data transfer.
The card also evaluates monitorability and controllability, asking what evidence safety systems need to detect problematic behavior.
Preparedness testing covers bio/chem, cybersecurity, and AI self-improvement with expert benchmarks and external evaluations.
Figure 7: Chain-of-thought monitorability under different monitor visibility settings. Source: original article.
3
Why it matters for evals
The system card is closer to a release-safety operating system than a single benchmark report.
Its most reusable lesson for agent evals is to grade operational behavior: what the model does with tools, how it behaves under adversarial context, what risks appear in real traces, and whether monitors can see enough to catch failures.
It also separates capability measurement from deployment posture: safeguards, external review, severity labels, and production-derived failure categories are part of the evaluation evidence.
Comments
No comments yet.