Course 2 · Module 11 · 70 minutes

From notebook
to a real
production API.

A model in a Jupyter notebook helps nobody. To create value, the model has to run when users call it — reliably, at scale, with proper error handling. This module is the bridge: from model.predict() in a cell to a live HTTPS endpoint at api.yourcompany.com/predict. You'll watch a real API evolve from naive to production-grade in three stages.

You'll build

A FastAPI service

You'll test

Real HTTP requests

You'll score

Your deploy readiness

Ship it

Part 01 · The gap

What works in a notebook
breaks in production.

A Jupyter notebook is a fantastic place to develop. It is a terrible place to serve users. Four things change the moment your model has to handle real traffic.

Concurrency

One user vs. one thousand.

A notebook serves one user (you). A production API needs to handle hundreds of simultaneous requests without falling over.

You run a cell. Wait. See output.

10/s requests · auto-scaling · queues

Reliability

Crashes are catastrophic.

In a notebook, a crash just means re-run the cell. In production, a crash means downtime — and possibly waking someone up at 3am.

Bad input → exception → restart kernel

Validate everything · return proper errors · stay up

Observability

You can't watch every prediction.

You can eyeball notebook outputs. You can't eyeball a million daily predictions. You need logs, metrics, and alerts that tell you when something's wrong.

Print statements · visual inspection

Structured logs · dashboards · alerting

Reproducibility

"It worked on my machine" doesn't fly.

The model that worked in your notebook had specific Python, library, and OS versions. Production needs to recreate that environment exactly — every time.

conda env · works for you · today

Dockerfile · requirements pinned · works always

Part 02 · Hands on · Build an API in 3 stages

Watch an API evolve from
naive to production-grade.

Below: a real FastAPI service that wraps an Iris classifier. Three stages of growing sophistication. Switch between them, see the code change, and send real (simulated) HTTP requests to feel what happens at each stage. Bonus: in Stage 1, try sending broken input. The server crashes. That's the whole point — you'll fix it in Stage 2.

How to use it.

Pick a stage at the top — the code panel updates to show that stage's implementation. Then on the right, pick an endpoint, choose a body preset (or write your own JSON), and click Send Request. The simulated server runs the actual logic and returns a real HTTP-style response with status codes, latency, and explanatory notes.

main.py stage 1

Request

POST

Body (JSON)

Response

Send a request to see the server response

Part 03 · The 5-stage deployment pipeline

Five steps from commit
to live URL.

Every ML deployment, regardless of platform, goes through these five stages. Modern tools collapse multiple stages into one click — but knowing the underlying flow keeps you debugging effectively when something breaks.

Serialize

Save the trained model to disk so it can be loaded by the API. Pickle for sklearn, ONNX for cross-framework, GGUF for LLMs.

// joblib · pickle · ONNX

Wrap

Put the model behind an HTTP API. The whole world calls your model the same way: an HTTP request to your endpoint.

// FastAPI · Flask · LitServe

Containerize

Package the code + model + dependencies into a Docker image. Same image runs everywhere — your laptop, AWS, friend's cluster.

// Dockerfile · docker build

Deploy

Push the image to a hosting platform that runs it on demand. The platform handles HTTPS, scaling, restarts.

// HF Spaces · Railway · AWS

Monitor

Watch latency, errors, and prediction quality post-deployment. Set up alerts so you know before users complain.

// Datadog · Sentry · Grafana

Part 04 · Where to deploy

Six platforms.
Pick the one that fits.

Deployment platforms have multiplied in the last 5 years. These six cover 95% of cases — from "free demo this weekend" to "100M predictions a day."

Beginner · Free

Hugging Face Spaces

huggingface.co/spaces

Free hosting for ML demos. Push your Gradio or Streamlit app, get a public URL. Perfect for portfolios and quick prototypes.

Best forDemos, prototypes

CostFree

Setup time10 minutes

Scales toSmall traffic

Beginner · Free

Streamlit Cloud

share.streamlit.io

Free hosting for Streamlit apps. Connect to GitHub, push code, deploy. Excellent for data dashboards and internal tools.

Best forDashboards, internal tools

CostFree tier generous

Setup time5 minutes

Scales toMedium traffic

Intermediate · Paid

Railway / Render

railway.app · render.com

Push any Dockerfile or Python service, get a URL. No Kubernetes hell. Autoscaling, HTTPS, custom domains included.

Best forReal APIs

Cost$5+ per month

Setup time30 minutes

Scales toReal production

Intermediate · Paid

Modal / Replicate

modal.com · replicate.com

Serverless GPU-backed ML hosting. Cold-starts a container per request, scales to zero when idle. Pay per second of GPU time.

Best forGPU inference, batch jobs

CostPer-second GPU usage

Setup time1 hour

Scales toMassive bursts

Advanced · Cloud

AWS / GCP / Azure

SageMaker · Vertex · Azure ML

Full hyperscaler ML platforms. Everything you'd need at any scale — but with significant complexity. Most enterprises end up here.

Best forEnterprise scale

CostVariable, often expensive

Setup timeDays to weeks

Scales toAnything

Advanced · DIY

Self-hosted / k8s

Your own server · Kubernetes

Maximum control, maximum responsibility. Run on your own hardware or VPS. Use when costs, compliance, or latency demand it.

Best forPrivacy, custom needs

CostHardware + your time

Setup timeDays+

Scales toWhatever you build

Part 05 · Hands on · Are you ready?

12 questions.
Score your deploy-readiness.

Check off what you've actually done for a model you want to ship. The verdict at the top updates live. If you score below 7, your weekend will be longer than you planned.

Deployment readiness

0 / 12

Not ready

// Model

Model is serialized to disk

joblib, pickle, ONNX, or similar — can be loaded outside the training notebook

Model version is tracked

Git tag, MLflow run ID, or weights-and-biases — so you can roll back

Training data is versioned

DVC, snapshots, or at minimum a checksum — so retraining is reproducible

// API

Input validation (Pydantic/schemas)

Bad inputs return 422 errors, not 500 crashes

Proper error handling

try/except for predictable failures, appropriate HTTP status codes

/health endpoint exists

Lets your platform's load balancer know the service is alive

// Infrastructure

Containerized (Dockerfile)

Same image runs on your laptop and in production — zero "works on my machine"

Config via environment variables

No hardcoded API keys, DB URLs, or model paths in your source

Pinned dependencies

requirements.txt or pyproject.toml with exact versions — not "scikit-learn" but "scikit-learn==1.3.2"

// Monitoring

Structured logs (inputs + outputs)

Can replay any prediction, debug any complaint — JSON logs to stdout

Latency & error rate tracked

A dashboard answers: p95 latency? error rate? requests/second?

Alerts on anomalies

Get paged before users notice — error rate spike, latency spike, model drift

Course 2 · Module 11 complete

You can now ship
what you build.

You watched a real API mature from naive to production-grade. You know what makes deployments reliable — and what makes them brittle. You can score your own projects against a 12-point readiness rubric. The notebook-to-production gap is no longer a mystery. It's just a checklist.

Up next · Course 2 · Module 12 · CAPSTONE

Your Own End-to-End ML Project

The grand finale. You'll synthesize every skill from this course into one complete project: pick a problem, prepare data, train a model, evaluate it, deploy it. The capstone is where the course modules become a portfolio piece.

Continue to Capstone

From notebookto a realproduction API.

What works in a notebookbreaks in production.