arXiv — ProofAgent Harness: Open Infrastructure for Adversarial Evaluation of AI Agents
arXiv paper · source date 2026-05-22 · added 2026-05-27 18:21:24 · updated 2026-05-30 17:20:13 · Open original blog
1
Problems / challenges / motivations
- Agent products increasingly use tools, remember context, handle private data, and interact across many turns, so isolated-output grading misses failures that emerge only through trajectory and pressure.
- Static benchmarks can hide selective weakness: an agent may look strong on a headline score while failing through unsafe reframing, weak metrics, fragile intermediate turns, or manipulation paths.
- High-risk domains such as customer support, medical triage, privacy/security, and code generation need evidence that is auditable, not just a pass/fail score detached from the conversation.
2
Key ideas
- ProofAgent Harness wraps an agent with infrastructure that curates evaluation intelligence, runs adversarial multi-turn trials, captures behavioral traces, scores results post hoc, resolves juror disagreement, and produces evidence-linked reports.
- Its core method is Adversarial Multi-Juror Scoring with Turn-Level Audit: calibrated juror personas judge completed behavior under pressure and tie scores back to specific turns.
- The framework is extensible: teams can add domains, traps, metrics, juror personas, scoring rules, and report formats instead of treating a benchmark as a fixed leaderboard.
- A notable claim is that a small local quantized Harness LLM can pressure stronger production agents when embedded in a good evaluation pipeline, suggesting harness design matters as much as evaluator model scale.
3
Why it matters for evals
- This paper pushes agent evals toward adversarial infrastructure: collect traces, apply multiple judges, preserve evidence, and make disagreement visible before deployment.
- For production eval design, the reusable pattern is to test behavior under pressure: privacy boundary tests, policy reframing, tool misuse attempts, and multi-turn manipulation rather than only clean happy-path tasks.
- The eval artifact should become a diagnosis report: which turn failed, which policy or capability was implicated, which jurors disagreed, and what change would reduce risk.
Comments
No comments yet.