AI & Agent Evaluation
475total visitsadmin
reading room / notes / evals

Reading room

Short summaries of AI and agent evaluation research, organized by broad tags.

$ evals.index --public
posts: 6
mode: short summaries
storage: sqlite
status: listening
Sort by source dateLatest firstEarliest first

Filtering by harnesses. Clear filter.

arXiv — ProofAgent Harness: Open Infrastructure for Adversarial Evaluation of AI Agents

arXiv paper · source date 2026-05-22 · 0 comments · original

1. Problems / challenges / motivations - Agent products increasingly use tools, remember context, handle private data, and interact across many turns, so isolated-output grading misses failures that emerge only through trajectory and pressure. - Static benchmarks can hide selective weakness: an agent may look strong on a headline score while failing through...

arXiv — Code as Agent Harness: Toward Executable, Verifiable, and Stateful Agent Systems

arXiv survey · source date 2026-05-18 · 0 comments · original

1. Problems / challenges / motivations - Modern LLM agents increasingly succeed or fail because of the runtime around the model: tools, code execution, memory, sandboxes, repositories, validators, permissions, traces, and feedback loops. - Final task success is too flat for this world. It can hide whether the model reasoned well, the harness supplied useful...

OpenReview — Agent Harness Engineering: A Survey

OpenReview survey · source date 2026-05-14 · 0 comments · original

1. Problems / challenges / motivations - The paper argues that real-world LLM-agent reliability is often constrained less by the base model than by the execution harness around it: environment, tools, context, orchestration, observability, evaluation, and governance. - Prompt engineering and context engineering are no longer enough for production agents....

arXiv — Meta-Harness: End-to-End Optimization of Model Harnesses

arXiv paper · source date 2026-03-30 · 0 comments · original

1. Problems / challenges / motivations - Meta-Harness starts from a harness-engineering problem: the same frozen model can perform very differently depending on surrounding code for retrieval, memory, prompt construction, tool loops, and completion logic. - Existing text optimizers often compress experience into scalar scores, short summaries, fixed...

Anthropic — Harness design for long-running application development

engineering blog · source date 2026-03-24 · 0 comments · original

1. Problems / challenges / motivations - Long-running coding and frontend-generation agents degrade as context fills, coherence drops, and models develop “context anxiety.” - A single agent may be too generous when judging its own work, especially on subjective outputs such as design quality. - For long tasks, the surrounding harness can matter as much as...

Vercel — AGENTS.md outperforms skills in our agent evals

engineering blog · source date 2026-01-27 · 0 comments · original

1. Problems / challenges / motivations - Vercel wanted coding agents to use version-matched Next.js 16 documentation, but optional knowledge packages only help if the agent actually invokes them. - A support system can look good in theory while failing at the trigger layer: the agent may not know when to load a skill, may load it too late, or may be...