A model in a Jupyter notebook helps nobody. To create value, the model has to run when users call it — reliably, at scale, with proper error handling. This module is the bridge: from model.predict() in a cell to a live HTTPS endpoint at api.yourcompany.com/predict. You'll watch a real API evolve from naive to production-grade in three stages.
A Jupyter notebook is a fantastic place to develop. It is a terrible place to serve users. Four things change the moment your model has to handle real traffic.
A notebook serves one user (you). A production API needs to handle hundreds of simultaneous requests without falling over.
In a notebook, a crash just means re-run the cell. In production, a crash means downtime — and possibly waking someone up at 3am.
You can eyeball notebook outputs. You can't eyeball a million daily predictions. You need logs, metrics, and alerts that tell you when something's wrong.
The model that worked in your notebook had specific Python, library, and OS versions. Production needs to recreate that environment exactly — every time.
Below: a real FastAPI service that wraps an Iris classifier. Three stages of growing sophistication. Switch between them, see the code change, and send real (simulated) HTTP requests to feel what happens at each stage. Bonus: in Stage 1, try sending broken input. The server crashes. That's the whole point — you'll fix it in Stage 2.
Pick a stage at the top — the code panel updates to show that stage's implementation. Then on the right, pick an endpoint, choose a body preset (or write your own JSON), and click Send Request. The simulated server runs the actual logic and returns a real HTTP-style response with status codes, latency, and explanatory notes.
Every ML deployment, regardless of platform, goes through these five stages. Modern tools collapse multiple stages into one click — but knowing the underlying flow keeps you debugging effectively when something breaks.
Save the trained model to disk so it can be loaded by the API. Pickle for sklearn, ONNX for cross-framework, GGUF for LLMs.
Put the model behind an HTTP API. The whole world calls your model the same way: an HTTP request to your endpoint.
Package the code + model + dependencies into a Docker image. Same image runs everywhere — your laptop, AWS, friend's cluster.
Push the image to a hosting platform that runs it on demand. The platform handles HTTPS, scaling, restarts.
Watch latency, errors, and prediction quality post-deployment. Set up alerts so you know before users complain.
Deployment platforms have multiplied in the last 5 years. These six cover 95% of cases — from "free demo this weekend" to "100M predictions a day."
Free hosting for ML demos. Push your Gradio or Streamlit app, get a public URL. Perfect for portfolios and quick prototypes.
Free hosting for Streamlit apps. Connect to GitHub, push code, deploy. Excellent for data dashboards and internal tools.
Push any Dockerfile or Python service, get a URL. No Kubernetes hell. Autoscaling, HTTPS, custom domains included.
Serverless GPU-backed ML hosting. Cold-starts a container per request, scales to zero when idle. Pay per second of GPU time.
Full hyperscaler ML platforms. Everything you'd need at any scale — but with significant complexity. Most enterprises end up here.
Maximum control, maximum responsibility. Run on your own hardware or VPS. Use when costs, compliance, or latency demand it.
Check off what you've actually done for a model you want to ship. The verdict at the top updates live. If you score below 7, your weekend will be longer than you planned.
Aim for 4/5. Wrong answers explain themselves.
You watched a real API mature from naive to production-grade. You know what makes deployments reliable — and what makes them brittle. You can score your own projects against a 12-point readiness rubric. The notebook-to-production gap is no longer a mystery. It's just a checklist.
Continue to Capstone