Course 3 · Module 13 · 85 minutes

From demo
to production:
100× more code.
None of it AI.

The demo: a prompt + an API call. The production system: that same code plus 99% more — eval harnesses, regression suites, prompt versioning, guardrails, drift monitoring, fallback chains, cost dashboards, PII redaction, jailbreak detection, structured-output validation. Every team building real AI products learns this the hard way. This module covers the production hygiene that separates "demo" from "shipped."

You'll run

An eval harness

You'll watch

Drift detection fire

You'll list

12 things demos skip

Get serious

Part 01 · What changes from demo to production

A great demo is
not a great product.

A demo shows what the system can do. Production guarantees what it will do — for every user, every input, every time. The gap between those two states is filled with engineering, not AI.

// 01 · demo mindset

"It works on these examples."

→Hand-picked inputs that show the system's strengths
→One model, one prompt, one configuration
→No checking when the model is wrong
→Cost and latency mostly invisible
→User input is friendly and well-formed
→"It's an LLM, so it just works"

// 02 · production mindset

"It works on all inputs, always."

✓Eval suite with 100s of test cases across categories
✓Prompt versioning so you can roll back regressions
✓Drift monitors watching quality, cost, behavior daily
✓Guardrails for jailbreaks, PII, off-topic input
✓Fallback chains when the primary model fails
✓Observability to know why a specific response was bad

Part 02 · Hands on · The eval harness

Test cases.
Many dimensions.

Every production AI system has an eval suite. It's a set of representative test cases that get run against the system — not just for correctness, but for safety, format, latency, and cost. Below: a realistic eval suite for a customer-support chatbot. Press run and watch each test case get scored across all 5 dimensions.

How real eval suites work.

The test cases come from production data (anonymized real queries), edge cases (jailbreak attempts, malformed input), and regression cases (things that were once broken). For each test, multiple metrics get checked: correctness (is the answer right?), safety (refuses appropriately?), format (valid JSON?), latency (under SLA?), cost (under budget?). A single ✗ on any dimension fails the test. Production teams run this on every change.

tests: 0 / 7 · pass rate: —

// test case

Correct

Safe

Format

Latency

Cost

Verdict

Part 03 · Guardrails · the defensive perimeter

Catch the bad input
before the model sees it.

// the guardrail pipeline

Defense in depth.

Every production AI system wraps the model in layers of validation. You can't trust user input (jailbreaks, prompt injection, PII). You can't fully trust model output (hallucinations, toxic content, malformed JSON). The defenses go before AND after the model.

Input filters catch what shouldn't reach the model: known jailbreak patterns, prompt injection attempts ("ignore previous instructions"), PII (SSNs, credit cards) that should be redacted, off-topic queries that should be routed elsewhere. Often handled by smaller specialized classifiers running before the main model.

Output validation catches what the model gets wrong: schema violations (when you expected JSON), toxic language, hallucinated facts (cross-checked against retrieved context), PII the model accidentally regurgitated. A second pass — sometimes by another LLM, sometimes by deterministic rules — is now standard.

The category leaders here: NeMo Guardrails (NVIDIA), Lakera Guard, Guardrails AI, AWS Bedrock Guardrails. All let you declare what's allowed and what's not, with the runtime enforcing it. By 2024, shipping AI without guardrails was professional malpractice in most regulated industries.

Part 04 · Hands on · Drift detection

Production quality
decays.

Even if your model never changes, the world around it does. User queries shift. New attack patterns emerge. Domain language evolves. Behavior that was correct in week 1 starts failing in week 4. Drift monitoring is the practice of watching this happen, in real time, before users notice.

What you're monitoring.

Four production metrics tracked over 30 days. quality score (LLM-as-judge rating, 0-100). p99 latency (worst-case response time). refusal rate (how often the model declines). cost per query (token spend). All four can drift independently. The right move depends on which drifts. Try each scenario.

day: 0 / 30

// 30-day metric history

Part 05 · Production hygiene · four habits

The unglamorous practices
that save you.

These aren't fashionable. None of them are AI breakthroughs. All of them separate teams that ship reliably from teams that don't.

Version your prompts like code

Prompts are configuration and code and data all at once. Treat them like any of those. Every prompt change goes through review, gets a version number, runs against the eval suite, and is rollback-able when (not if) it breaks something.

Example file prompts/
├─ support_agent_v1.3.txt
├─ support_agent_v1.4.txt ← new
└─ CHANGELOG.md

Eval shows v1.4 regresses refund-handling by 8% → rollback

Fallback chains

Your primary model will go down. The API will rate-limit you. A request will time out. Have a fallback chain: try primary, on failure try a secondary (different provider, ideally), on failure try a cached/rule-based response, on failure return a graceful error message — never a 500.

Example chain Claude Opus → on fail or >3s →
Claude Sonnet → on fail or >5s →
Cached similar query → on miss →
"We're having trouble. Try again in a moment."

Trace every request

Every LLM call should have a trace ID that follows it through the entire pipeline. When a user complains about a specific bad response, you need to pull up: the input, every retrieval result, every tool call, every retry, every guardrail decision, the model output, the final response. Without traces, debugging is impossible.

Example trace trace_id: 7f3c-b2a8
├─ guardrail_input: passed (5ms)
├─ retrieval: 5 docs (120ms)
├─ llm_call: 1240ms · 850 tokens
├─ guardrail_output: passed (8ms)
└─ total: 1373ms · $0.012

Test in production · safely

Eval suites catch a lot — but they can't predict real user behavior. A/B test changes on small slices first. Roll a new prompt to 1% of traffic. Watch your monitors for a day. Promote to 10%. Then 50%. Then 100%. Be ready to roll back instantly. This is normal SRE practice applied to AI.

Rollout schedule T+0: 1% of traffic gets new prompt
T+1d: monitors green → 10%
T+3d: monitors green → 50%
T+7d: full rollout

(Any rollback target: 10 minutes)

Part 06 · The observability stacks shipping today

Where you actually
see what's happening.

By 2024, a dedicated category of LLM observability tooling emerged. Pick one — they all do similar things, with different ergonomics.

LangChain · 2023

LangSmith

// the framework's own observability

TracingPer-step LangChain trace tree

EvalsBuilt-in dataset + scoring

Best forTeams already using LangChain

PricingFree tier · usage-based

Integration1 line for LangChain users

The most popular observability platform thanks to LangChain's reach. Every LangChain primitive auto-traces by default. If you're already using their framework, this is the path of least resistance. Standalone usage works too.

Braintrust · 2023

Braintrust

// eval-first platform

ApproachDatasets, experiments, regressions

DistinctiveCompare prompts side-by-side at scale

WorkflowCI-friendly · git-style diffs

StrengthBuilt for rigorous A/B testing

CustomersNotion, Vercel, Airtable, ...

Built by people who experienced "prompt regression hell" at scale. The killer feature: comparing two prompts across a 1000-example dataset, with per-example pass/fail visible. This is what production prompt iteration actually needs.

Helicone · 2023

Helicone

// the proxy-based approach

ArchitectureHTTP proxy → drop-in

SetupChange 1 URL, done

StrengthCost tracking, caching, rate limits

Open source?Yes

Best forWant observability with zero refactor

The simplest integration in the category. Point your API client at helicone.ai instead of the model provider, get instant logging, cost tracking, caching, alerting. No code changes beyond the base URL. Open-source self-hostable.

Arize · 2023

Phoenix · OpenTelemetry-native

// open standard observability

Built onOpenTelemetry (OTel)

DistinctiveWorks alongside Datadog, Honeycomb, etc.

ApproachEmbed in existing observability stack

Open sourceYes · self-hostable

Best forTeams with mature observability already

Built on the OpenTelemetry standard, which means LLM traces show up in the same dashboards as your existing infrastructure metrics. If you're already using Datadog/Honeycomb/Jaeger, Phoenix slots into the same view. The "no new vendor" choice.

Course 3 · Module 13 complete

You can spot the
gap between demo and shipped.

You ran an eval harness across 5 dimensions. You watched drift detection fire on actual time-series data. You can list the four production-hygiene habits that ship AI reliably. You know why every serious team has guardrails, prompt versioning, fallback chains, and traces. The next time someone shows you a polished AI demo, you'll know which 12 things they're not yet doing. The capstone is next.

Up next · Course 3 · Module 14 · CAPSTONE

Build a Multi-Agent System

The series finale. Everything you've learned — transformers, attention, MoE, multimodal, long context, tools, agents, fine-tuning, inference, vector search, production hygiene — composes into one capstone: a working multi-agent system. A coordinator agent that delegates to specialist agents, each with their own tools, retrieving from a shared knowledge base, with full observability. The end of the road.

Continue to Capstone

From demoto production:100× more code.None of it AI.

A great demo isnot a great product.