---
title: "Evals Are the Contract: How to Build an Eval Set Before You Trust an AI Deployment"
description: "A practical guide to building an eval set for an AI deployment: what to measure, how to collect cases, how to grade, and why evals are the contract between a pilot and production."
date: "2026-05-17"
status: "published"
---

# Evals Are the Contract: How to Build an Eval Set Before You Trust an AI Deployment

Most teams discover they need evals the week after they shipped without them.

Something changed. A model version. A prompt. A retrieval index. A tool. The system "feels" different, and nobody can say whether it got better or worse. Someone runs the same three questions they always run, the answers look fine, and the change goes out. Two weeks later a user finds the output that quietly broke, and the team learns the hard way that "looks fine on my examples" is not a measurement.

An eval set is the fix. It is the contract between a pilot and production: a written, runnable definition of what "good" means for this system, on this workflow, with these constraints. Without it, every change is a guess and every rollback is an argument.

This is a field guide to building one before you trust a deployment, not after you regret one.

## What an eval set actually is

An eval set is a collection of representative inputs, each paired with a way to judge the output. That is the whole idea. The discipline is in the details.

A useful eval set has three parts:

- **Cases.** Real inputs the system will face, including the ugly ones.
- **Expectations.** What a correct, acceptable, or unacceptable response looks like for each case.
- **A grader.** A repeatable way to score the output against the expectation, whether that is an exact match, a rule, a human rubric, or another model acting as judge.

OpenAI frames evaluations as a way to test whether outputs meet criteria you specify, especially when you change prompts or models. Anthropic's evaluation tooling is built around running prompts across scenarios and rerunning the suite after changes. NIST's AI Risk Management Framework treats measurement as a core function, not an afterthought. They are all describing the same loop: define good, measure against it, and rerun the measurement every time something changes.

## Why a demo is not an eval

A demo proves the system can produce a good answer once. An eval proves the system produces good answers across the distribution of inputs it will actually see.

The gap between those two is where pilots die.

A demo is curated. Someone picked the question, knew the answer existed in the data, and avoided the edge cases on purpose. Production is not curated. Users ask the question three different ways, reference a document that was deleted, paste in a table, or ask something the system was never meant to handle. If your only evidence is a demo, you have measured the best case and called it the expected case.

## How to build the set

### 1. Collect cases from reality, not imagination

The fastest way to build a weak eval set is to invent the questions yourself. You will write the questions the system already answers well.

Pull cases from where the work actually happens: support tickets, Slack threads, search logs, CRM notes, past emails, the spreadsheet the workflow owner keeps. If the system is live in a limited pilot, log every real input. Aim for coverage over volume. Forty cases that span the workflow beat four hundred variations of the same easy question.

### 2. Deliberately include the failure cases

A good eval set is mostly the cases you are afraid of.

- The question the data cannot answer, where the right behavior is "I don't know," not a confident guess.
- The input with a permission boundary, where the right behavior is to refuse or scope down.
- The ambiguous request that needs a clarifying question.
- The adversarial input designed to make the model ignore its instructions.
- The stale or contradictory source, where the right behavior is to flag the conflict.

If your eval set has no failure cases, it is not measuring whether the system is safe to deploy. It is measuring whether it is pleasant in a demo.

### 3. Write the expectation before you see the output

Decide what good looks like for each case before you run the system. This sounds obvious and is constantly violated. When you write the expectation after seeing the output, you rationalize whatever the model produced. The grade drifts to match the system instead of the standard.

Expectations do not have to be exact strings. They can be:

- **Must contain** a specific fact or citation.
- **Must not contain** a category of content, a hallucinated entity, or an action outside policy.
- **Must cite** its source for any factual claim.
- **Must refuse** or escalate.
- **Must match** a rubric a human or a model grader can apply consistently.

### 4. Choose a grader you can run a hundred times

Manual review is the most accurate grader and the least scalable. A model-as-judge is scalable and introduces its own noise. Rules and exact matches are cheap and brittle. Most production eval sets use a mix: rules for the things that are checkable, a model judge for the things that need reading, and human spot-checks to keep the judge honest.

The non-negotiable property is repeatability. If you cannot run the grader unattended after every change, you do not have an eval. You have a habit of looking.

## What to measure beyond "is the answer good"

Answer quality is necessary and not sufficient. Production systems fail on dimensions a single quality score hides.

| Dimension | The question it answers |
| --- | --- |
| Accuracy | Is the answer correct against the expectation? |
| Grounding | Is every factual claim supported by a real source, with a citation? |
| Refusal behavior | Does it say "I don't know" instead of guessing when the data is missing? |
| Safety | Does it resist prompt injection and stay inside permission boundaries? |
| Consistency | Does the same input produce a stable answer across runs? |
| Cost and latency | Does the answer arrive within budget and time limits? |

A system that is accurate but ungrounded, or accurate but too expensive, or accurate but happy to be jailbroken, is not ready. The eval set should reflect everything you would be embarrassed to discover in production.

## Wiring evals into the deployment loop

An eval set is only worth the cost of building it if it runs at the moments that matter.

- **Before any change ships.** New prompt, new model, new retrieval config, new tool. Run the suite. Compare to the last known-good baseline. A regression on a failure case should block the change.
- **On a schedule in production.** Data drifts. Sources go stale. A suite that passed in March can quietly fail in June because the underlying documents changed. Rerun against live data.
- **When monitoring catches a miss.** Every real-world failure becomes a new case. The eval set should grow over the life of the deployment, not freeze at launch.

This is the part that separates an eval set from a one-time test. The set is a living artifact. Field pain becomes a case. The case prevents the same pain from shipping twice.

## The handoff value

Here is the quiet reason evals matter for forward-deployed work specifically: an eval set is the most transferable thing you can leave behind.

A deployment that only the original builder can judge is a hostage situation. The next engineer inherits the code, but not the judgment. An eval set encodes the judgment. It tells the inheritor what this system is supposed to do, which cases are dangerous, and how to know if their change made things worse. It is the difference between handing over a black box and handing over a contract.

When a deployment leaves behind a runnable eval set, the next person can change the system with confidence. When it does not, the next person is afraid to touch anything, and the deployment slowly rots because nobody can prove a change is safe.

## The minimum bar

If you are deciding whether a pilot is ready to be trusted, ask five questions about its evals:

1. Do real cases exist, drawn from the actual workflow, not invented?
2. Do the cases include the failures you are afraid of, not just the happy path?
3. Were the expectations written before the outputs were seen?
4. Can the grader run unattended after every change?
5. Does the set grow when production surfaces a new failure?

If the answer to any of these is no, you do not yet have a measurement. You have an opinion about a system you are about to trust with real work.

Evals are not paperwork. They are the contract. Sign it before you deploy, not after.

## Sources

- OpenAI evaluations guide: <https://platform.openai.com/docs/guides/evals>
- Anthropic on building evaluations: <https://docs.anthropic.com/en/docs/test-and-evaluate/develop-tests>
- NIST AI Risk Management Framework: <https://www.nist.gov/itl/ai-risk-management-framework>
- OWASP Top 10 for LLM Applications: <https://owasp.org/www-project-top-10-for-large-language-model-applications/>