What to measure
Define the properties the eval should estimate across the model, runtime system, and product experience. A metric matters only if it maps to a concrete product or deployment decision.
A practical map of how teams test AI and agent systems: what to measure, tasks, execution traces, grading, metrics, and feedback loops.
AI eval, or AI evaluation, is the practice of testing LLMs, AI agents, and GenAI applications for correctness, reliability, safety, tool use, and production behavior. This site collects practical notes on agent benchmarks, graders, traces, production monitoring, and reliability testing for AI systems.
A compact map of how AI systems are evaluated from design-time goals to production feedback.
Define the properties the eval should estimate across the model, runtime system, and product experience. A metric matters only if it maps to a concrete product or deployment decision.
Design tasks that reflect real usage, workflow complexity, and likely failure modes. Good scenarios cover both normal demand and the edge cases that break systems.
Run the model inside the production-like stack and capture enough execution evidence to debug outcomes, attribute failures, and compare behavior across versions.
Judge attempts offline and turn runs into interpretable signals. Combine objective checks with calibrated evaluators and metrics that expose quality, uncertainty, regressions, and cost.
Learn from real usage: whether the system understood intent, created quality outcomes, produced durable value, and improved through experiments and telemetry. Online evidence and offline grading calibrate each other.
Traditional ML eval usually measures a single prediction against known ground truth. Modern GenAI and agent evals still use that discipline, but expand the unit of evaluation from one output to a whole system trajectory.
Often looks like: input → single prediction → compare with ground truth. The core object is a function, f(x) → ŷ.
More often looks like: goal → reasoning → tools → environment → artifact → evaluation. The object is the behavior of a system over a trajectory.