AI Evaluation Glossary — LLM and Agent Eval Terms

Agent benchmark

A benchmark designed for systems that plan, call tools, and act across multiple steps. It usually scores task completion, tool use, state changes, recovery behavior, and trace quality rather than one final answer.

Agent harness

The runtime scaffold around an agent: prompts, tools, memory, policies, retry logic, and execution loop. A harness can strongly affect eval results, so model comparisons should control or report it.

Alignment

The process of making a model or agent behave according to intended goals, human values, and safety constraints. In evals, alignment is tested through edge cases, adversarial prompts, policy scenarios, and real user workflows.

Calibration

How well a system's confidence matches actual correctness. A calibrated model is uncertain when it is likely wrong, which matters for escalation, human review, and high-stakes decisions.

Canary task

A small, stable eval task used as an early warning signal. Canary tasks should be cheap, frequently run, and sensitive to regressions in important behaviors.

Deterministic grader

A script, rule, unit test, database query, or exact verifier that scores outputs or final state without model judgment. It is preferred when success can be objectively checked.

Eval harness

The infrastructure that runs tasks, resets environments, captures traces, calls graders, and aggregates results. A good harness makes evals repeatable, auditable, and comparable across model or prompt changes.

Failure taxonomy

A structured list of ways a system fails, such as wrong reasoning, bad retrieval, unsafe tool use, formatting errors, or missed constraints. Taxonomies turn vague failures into fixable categories.

Golden set

A curated set of examples with trusted labels, expected answers, or target states. Golden sets are often used for regression testing, judge calibration, and manual review of high-value behaviors.

Ground truth

The reference answer, label, target state, or verified external outcome used to decide whether a system succeeded. For agents, ground truth may be a database state, file diff, or transaction result.

Hallucination

A fluent but unsupported, fabricated, or false output. In agent systems, hallucination can also mean claiming that a tool action succeeded when the external state did not actually change.

Inter-rater agreement

The degree to which multiple human graders give the same labels or scores. Low agreement often means the rubric is ambiguous, the task is subjective, or more grader training is needed.

Judge model

An LLM used to grade another model or agent. Judge models scale open-ended evaluation, but they need calibration against humans or deterministic checks because they can be biased or inconsistent.

Pass@k

The probability that at least one of k attempts succeeds. It is useful when retries, sampling, or best-of-k selection are allowed, and it measures potential capability more than consistency.

Pass^k

The probability that all k attempts succeed. It is stricter than pass@k and better reflects reliability for customer-facing workflows where repeated attempts should all work.

Prompt injection

A security and reliability failure where untrusted text attempts to override instructions, leak secrets, manipulate tools, or change the agent's goal. It is especially important in browser and retrieval systems.

Red team eval

An evaluation that actively searches for harmful, unsafe, or policy-violating behavior. Red teaming often uses adversarial prompts, malicious documents, unusual workflows, or expert testers.

Reliability

How consistently a system succeeds across repeated trials, varied inputs, and realistic operating conditions. Reliability is different from peak capability: a model can sometimes solve hard tasks yet fail often.

Retrieval eval

An evaluation of whether a system finds, ranks, and uses the right source documents. It can measure recall, precision, citation faithfulness, and whether retrieved context actually improves answers.

Rubric

A written grading standard that defines success, partial credit, failure, and edge cases. Good rubrics reduce grader ambiguity and make model-based or human judgments easier to audit.

Safety eval

An evaluation focused on harmful behavior, policy violations, misuse risk, privacy leakage, prompt injection, unsafe tool use, or failure to refuse dangerous requests.

Tool

An external capability an AI system can call, such as search, code execution, browser control, file access, APIs, calendars, databases, or messaging systems.

Trace

The recorded sequence of messages, tool calls, observations, errors, decisions, and state changes during a run. Traces help explain why an aggregate score changed or why an agent failed.

Trajectory

The full path an agent takes through a task, including planning, intermediate steps, tool use, recoveries, and final outcome. Two agents may reach the same answer through very different trajectories.

Worst-case eval

An evaluation designed to expose rare but severe failures rather than average behavior. It is useful for safety, security, and reliability work where tail risk matters.