---
title: "AI Deployment Checklist: 15 Things to Prove Before a Pilot Becomes Production"
description: "A practical AI deployment checklist for moving pilots, agents, copilots, and RAG systems into production with evals, security, monitoring, adoption, and handoff."
date: "2026-05-14"
status: "published"
---

# AI Deployment Checklist: 15 Things to Prove Before a Pilot Becomes Production

An AI deployment checklist is a set of proofs a team should complete before moving an AI pilot into production. At minimum, it should cover workflow ownership, data boundaries, permissions, evals, human review, logging, monitoring, cost controls, prompt-injection defenses, security, rollback, adoption, and handoff.

The point is not paperwork. The point is to stop calling a demo a deployment.

AI pilots fail for boring reasons. The model works, but the data is stale. The prompt works, but nobody owns the workflow. The agent works in a sandbox, but production permissions break it. The RAG assistant answers ten sample questions, but nobody has an eval set. Security asks where the data goes. Users do not trust the output. The builder leaves, and nobody can explain the system two weeks later.

That is not an AI capability problem. It is a deployment problem.

This checklist is written for teams trying to move an AI pilot, internal copilot, RAG assistant, workflow automation, or AI agent into real use. It is also a diagnostic for knowing when to bring in a Forward Deployed Engineer.

## Why do AI pilots die before production?

Most AI pilots are built to prove possibility. Production systems have to prove reliability.

That is a different standard.

A pilot asks:

- Can the model do the task?
- Does the demo look useful?
- Can we connect the obvious data source?
- Will the stakeholder say yes?

Production asks:

- Who owns the workflow?
- Which data is authoritative?
- Who is allowed to see what?
- How do we know the answer is good?
- What happens when the model is wrong?
- What gets logged?
- What happens if cost, latency, or tool calls spike?
- Who is paged or notified when it breaks?
- Can another engineer inherit the system?
- Are users actually changing behavior?

The gap between those two lists is where most enterprise AI work stalls.

OpenAI's eval documentation frames evaluations as a way to test whether model outputs meet the criteria you specify, especially when changing prompts or models. Anthropic's evaluation tooling is built around testing prompts across scenarios and rerunning suites after changes. NIST's AI Risk Management Framework focuses on identifying and managing AI risks across design, development, use, and evaluation. OWASP's 2025 LLM Top 10 exists because LLM applications introduce security risks that traditional app checklists often miss, including prompt injection, system prompt leakage, vector and embedding weaknesses, excessive agency, and unbounded consumption.

The industry is converging on the same answer: production AI needs evidence.

## What should an AI deployment checklist include?

Use this checklist before you call an AI pilot production.

If you cannot answer most of these with evidence, you do not have a deployment yet. You have a promising prototype.

### 1. Is there a named workflow owner?

Every AI system needs a business owner who owns the workflow, not just a technical owner who owns the code.

The workflow owner should be able to explain the job the system changes, which users depend on it, what decision or action it affects, what failure would be unacceptable, and who can approve rollout or rollback.

If nobody owns the workflow, nobody owns whether the AI system is useful.

### 2. Is the production use case narrow enough to evaluate?

"Use AI for customer support" is not a production use case.

"Classify inbound support tickets into six routing categories, draft a suggested reply, and send low-confidence cases to a human queue" is closer.

A production use case should define:

- Inputs.
- Outputs.
- Users.
- Success criteria.
- Failure modes.
- Human review points.
- Systems touched.
- Boundaries of what the AI must not do.

If the use case cannot be evaluated, it is too vague to deploy.

### 3. Is there a source of truth for the data?

Many AI pilots quietly assume the data is cleaner than it is.

Before production, name the source of truth for every important input:

- CRM fields.
- Support tickets.
- Contract terms.
- Product docs.
- Customer records.
- Internal policy.
- Inventory or billing data.
- Knowledge base articles.

Then ask the ugly questions:

- Which fields are stale?
- Which teams use fields differently?
- Which records are duplicated?
- Which data should never enter the model context?
- How often does the source update?
- Who can correct bad data?

The AI system cannot be more trustworthy than the operational data underneath it.

### 4. Are permissions and data boundaries explicit?

Production AI inherits the permission mess of the business.

A user should not get access to information through an AI assistant that they could not access directly. An agent should not take actions outside the user's authority. A retrieval system should not mix confidential documents into answers for the wrong audience.

Before production, document:

- Authentication.
- Role-based access.
- Tenant or customer boundaries.
- Document-level permissions.
- Tool permissions.
- System prompt exposure assumptions.
- Data retention.
- Sensitive fields excluded from context.
- Whether data is sent to third-party APIs.

This is where many pilots die. The sandbox had one service account. Production has real boundaries.

### 5. Is there an eval set?

An eval set is the difference between "it seems good" and "we can measure whether it got better or worse."

OpenAI describes evals as a way to test model outputs against specified criteria and iterate on prompts or models. That matters because AI behavior changes when you change models, prompts, tools, retrieval logic, chunking, permissions, or input data.

Your eval set should include:

- Normal cases.
- Edge cases.
- Ambiguous cases.
- Known failure cases.
- High-risk cases.
- Examples from real user behavior.
- Negative examples where the system should refuse, escalate, or ask for clarification.

For a RAG assistant, include questions with known answers, source expectations, and cases where the answer is not in the corpus.

For an agent, include task traces, tool choices, action limits, and expected stopping behavior.

For workflow automation, include regression cases from the existing process.

### 6. Are success and failure thresholds defined?

An eval without a threshold is a dashboard, not a gate.

Before production, define what score is good enough to launch, which failures block launch entirely, which failures are acceptable with human review, what regression triggers rollback, and which metrics matter by user group or workflow type.

Accuracy alone is rarely enough. You may need to track groundedness, refusal quality, escalation rate, latency, cost, task completion, user override rate, and downstream business outcomes.

The threshold does not have to be perfect. It has to be explicit.

### 7. Are human review gates designed into the workflow?

Human-in-the-loop is not a magic phrase. It is a workflow design choice.

Define where humans review:

- Before an external message is sent.
- Before a record is updated.
- Before money moves.
- Before a customer is escalated.
- Before a policy decision is made.
- When confidence is low.
- When the system detects ambiguity.
- When the action is irreversible.

Also define what the human sees. A reviewer needs enough context to make a decision: model output, source documents, confidence, reasoning trace when appropriate, suggested action, and the alternative path.

If review is too slow or vague, users will bypass it. If review is missing, the system may create unacceptable risk.

### 8. Are failure modes documented?

Every AI system has predictable ways to fail.

Write them down before launch:

- Hallucinated answer.
- Wrong source.
- Stale source.
- Tool call to the wrong system.
- Permission leak.
- Prompt injection.
- System prompt leakage.
- Vector or embedding weakness.
- Overconfident classification.
- Bad escalation.
- Duplicate action.
- Silent non-action.
- Cost spike or unbounded consumption.
- Latency spike.
- User overreliance.

OWASP's 2025 LLM Top 10 is useful here because LLM apps have failure modes that look different from ordinary web apps: prompt injection, sensitive information disclosure, system prompt leakage, vector and embedding weaknesses, excessive agency, unbounded consumption, and supply chain risks, among others. OWASP also maintains guidance for agentic applications, which is useful when your AI system can plan, call tools, or take actions.

The point is not to imagine every possible disaster. The point is to name the likely ones and design checks around them.

### 9. Are logs and traces useful?

If the system fails and you cannot reconstruct what happened, it is not ready.

Log enough to answer:

- Who used the system?
- What input did they provide?
- What context was retrieved?
- What model or prompt version ran?
- What tools were called?
- What output was produced?
- What action was taken?
- What did the user accept, edit, reject, or override?
- What errors occurred?

Be careful with sensitive data. Logging everything can create its own security problem. The goal is useful observability, not a surveillance landfill.

### 10. Is monitoring in place after launch?

Production starts the day after launch.

Monitor:

- Quality.
- Latency.
- Cost.
- Token use.
- Tool-call loops.
- Rate-limit pressure.
- Error rate.
- Escalation rate.
- User adoption.
- Override rate.
- Retrieval misses.
- Tool failures.
- Policy violations.
- Regression against evals.

For AI systems, monitoring should connect production behavior back to the eval set. When users find new failure cases, those cases should become future tests.

If the eval set does not grow after launch, the deployment is not learning.

### 11. Has security reviewed the actual architecture?

Security should not review a vibe.

Give them:

- Data-flow diagram.
- Model providers.
- Tool permissions.
- Auth model.
- Secrets handling.
- Logging plan.
- Data retention policy.
- Third-party data exposure.
- Prompt injection risks.
- Human review gates.
- Rollback plan.

NIST's AI RMF and Generative AI Profile are useful because they map risk work to Govern, Map, Measure, and Manage functions. For deployment teams, that is practical: it turns an AI risk register into review evidence across lifecycle stage, system context, measurement, and mitigation. Security should see the AI system as a system, not a prompt.

### 12. Is there a rollback plan?

If you cannot turn it off safely, you are not ready to turn it on.

A rollback plan should name the switch that disables the system, who can trigger it, what happens to in-flight tasks, what manual workflow replaces the AI system, what data must be cleaned up, who must be notified, and which logs are reviewed after rollback.

Rollback is not pessimism. It is production hygiene.

### 13. Are users trained on the new workflow?

Adoption is not automatic.

Users need to know:

- What the system is for.
- What it is not for.
- When to trust it.
- When to review it.
- How to correct it.
- How to report a bad output.
- What changed in their workflow.
- Who owns support.

Training should be practical. Show the real interface, real examples, and real exceptions. Nobody needs a speech about transformation. They need to know what to do on Monday.

### 14. Is there an adoption metric?

Shipping is not adoption.

Pick at least one adoption metric that proves behavior changed. Depending on the workflow, that might be weekly active users in the target group, percent of eligible tickets processed, draft acceptance rate, human override rate, time saved per workflow, reduction in backlog, increase in first-contact resolution, fewer manual handoffs, or quality improvement against baseline.

Do not count "the pilot launched" as success.

The question is whether users changed the way work gets done.

### 15. Is there a handoff record?

A deployment only the original builder can explain is a liability.

Before production, leave:

- Architecture notes.
- Runbook.
- Eval set.
- Prompt and model versions.
- Data sources.
- Access decisions.
- Known limitations.
- Monitoring dashboard.
- Rollback instructions.
- Support owner.
- Product feedback.
- Next improvements.

The handoff record is where a prototype becomes an operating system.

## What changes by AI system type?

The same checklist applies across most AI deployments, but different systems stress different risks.

| System type | Extra proof before production |
| --- | --- |
| RAG assistant | Source grounding, retrieval evals, document permissions, stale-content handling, vector and embedding security, answer-not-found behavior |
| AI agent | Tool permissions, action limits, trace evals, approval gates, stopping behavior, system prompt leakage checks, rollback for actions |
| Internal copilot | User training, workflow fit, adoption metric, draft acceptance rate, escalation path |
| Workflow automation | Deterministic fallbacks, duplicate-action prevention, queue handling, audit logs, manual override |
| Customer-facing assistant | Safety policy, refusal behavior, brand tone, escalation to human support, monitoring for harmful outputs |

The dangerous mistake is treating all AI deployments as "chat with data." A retrieval assistant, a tool-using agent, and a workflow automation have different blast radiuses.

## What do most AI deployment checklists miss?

Most checklists over-index on model behavior and under-index on ownership.

They ask whether the model gives good answers. They ask less often whether anyone owns the workflow, whether users will trust the output, whether bad data can be corrected, whether the system can be rolled back, or whether another engineer can inherit the deployment.

The missing pieces are usually:

- Ownership.
- Adoption.
- Handoff.
- Product feedback.

That is why AI deployment is not just an engineering problem. It is a socio-technical system: code, data, people, incentives, risk, and operations.

## When should you bring in a Forward Deployed Engineer?

Bring in an FDE when the problem is not "can the model do this?" but "can this system survive our reality?"

That usually means:

- The workflow crosses multiple systems.
- The data is messy or permissioned.
- The use case is valuable but underspecified.
- The customer or business team cannot translate the workflow into technical requirements.
- Evals need to be built from scratch.
- Security and compliance matter.
- Adoption is uncertain.
- The deployment should teach the product team something reusable.

An FDE is useful when the hard part is the ugly middle: discovery, scope, build, integration, evals, rollout, and handoff.

That is also why this checklist should not stay theoretical. The useful question is not whether a team agrees with the list. It is how many items they can prove today.

## Free tool: AI Deployment Readiness Scorer

If you want to turn this checklist into a scored report, use the [AI Deployment Readiness Scorer](https://deployguild.dev/tools/ai-deployment-readiness). It checks one specific AI system, applies hard gates for critical blockers, and gives you the report immediately. Sharing contact details is optional if you want the PDF emailed and saved.

## FAQ

### What is an AI deployment checklist?

An AI deployment checklist is a list of production-readiness proofs for an AI system. It should cover workflow ownership, data quality, permissions, evals, human review, logging, monitoring, cost controls, security, rollback, user adoption, and handoff.

### What is the difference between an AI pilot and production?

An AI pilot proves that something might work. Production proves that it can run inside a real workflow with real users, real data, real permissions, quality checks, monitoring, rollback, and ownership.

### Do you need evals before deploying an AI system?

Yes. Evals are how you know whether the system improved or regressed when prompts, models, retrieval, tools, or data change. Without evals, teams rely on vibes and cherry-picked demos.

### What should be in a RAG production checklist?

A RAG checklist should include document permissions, source freshness, retrieval quality, answer grounding, citation behavior, vector and embedding security, answer-not-found behavior, evals with known answers, and monitoring for retrieval misses.

### What should be in an AI agent deployment checklist?

An AI agent checklist should include tool permissions, action limits, human approval gates, trace logging, task-level evals, stopping behavior, system prompt leakage checks, rollback for actions, and monitoring for excessive agency, unbounded consumption, or repeated tool failures.

### Who owns an AI deployment after launch?

Production ownership should be shared clearly between the workflow owner and the technical owner. The workflow owner owns business usefulness and adoption. The technical owner owns reliability, monitoring, security, and maintenance.

### When is an AI pilot not ready for production?

An AI pilot is not ready when there is no workflow owner, no eval set, unclear permissions, no failure-mode plan, no monitoring, no rollback path, no user adoption plan, or no handoff record.

## Why DeployGuild cares

This is where the checklist turns into a professional standard.

DeployGuild exists for the work between demo and production: the standard, artifacts, and professional judgment that make AI deployment reviewable.

The market does not need more AI prototypes that impress a conference room and die in the operating environment. It needs people who can take a real workflow, define the deployment standard, build the system, validate it, ship it, and leave proof behind.

That is Forward Deployed Engineering.

The demo asks whether AI can look useful. Deployment proves whether the system can be trusted with work.

## Sources

- OpenAI evals guide: <https://platform.openai.com/docs/guides/evals>
- OpenAI evaluation best practices: <https://platform.openai.com/docs/guides/evaluation-best-practices>
- OpenAI agent evals: <https://platform.openai.com/docs/guides/agent-evals>
- Anthropic evaluation tool: <https://docs.claude.com/en/docs/test-and-evaluate/eval-tool>
- OWASP Top 10 for LLM Applications: <https://genai.owasp.org/llm-top-10/>
- OWASP Agentic AI threats and mitigations: <https://genai.owasp.org/resource/agentic-ai-threats-and-mitigations/>
- OWASP Top 10 for Agentic Applications: <https://genai.owasp.org/2025/12/09/owasp-top-10-for-agentic-applications-the-benchmark-for-agentic-security-in-the-age-of-autonomous-ai/>
- NIST AI Risk Management Framework: <https://www.nist.gov/itl/ai-risk-management-framework>
- NIST Generative AI Profile: <https://www.nist.gov/itl/ai-risk-management-framework/generative-artificial-intelligence>
- Related DeployGuild guide: <https://deployguild.dev/blog/what-is-a-forward-deployed-engineer-ai-deployment-role>
