AI Skill Course Course 2 · CAPSTONE
Module 12 of 12
Course 2 · Module 12 · Capstone

One project.
Everything
you've learned.

A telecom company wants to predict which customers will churn — so they can intervene before losing revenue. You'll handle real-world tabular data, clean it, train three classifiers, compare them, tune the decision threshold against business value, and wrap the result as a callable production function. All of this runs in your browser using real scikit-learn. No notebooks, no setup, just the work.

Pipeline stages
5
Real sklearn
100% live
Skills synthesized
11 modules
Begin the capstone
The full ML lifecycle data 01 load prep 02 clean train 03 fit eval 04 test 🚀 05 ship "From raw csv → live prediction function" end-to-end · in your browser · real sklearn
Part 01 · The brief

Predict who's about to churn.

The setup

You're a data scientist at NorthWind Telecom. The retention team comes to you with a clear problem: every month, roughly 27% of new customers cancel within their first year. Acquiring a new customer costs ~$500 in marketing. Retaining one is almost free — if you can identify them in time.

They give you a snapshot of 1,000 historical customers and ask you to build a system that flags who's likely to churn next month — so retention specialists can call those customers first.

That's the whole brief. The rest is up to you. Below: five stages, each with real Python code running scikit-learn in your browser. By the end you'll have a callable predict_churn() function — the final deliverable.

// The dataset

NorthWind customers

Customers1,000
Features7 raw
Targetchurn (0/1)
Approx churn rate~27%
Revenue per save$500
Cost per false alarm$30
Train/test split80 / 20
Imbalanced?Mildly
Part 02 · The pipeline

Five stages.
All real scikit-learn.

Run them one at a time, or click "Run all stages" to watch the entire pipeline execute top to bottom. Each stage builds on the previous one — variables persist across stages, so you can\u2019t skip ahead. The code is editable: tweak things if you\u2019re curious.

Loading Pyodide runtime... Downloading Python + numpy + pandas + matplotlib + scikit-learn (~30s on first load)
Progress:
Stage 01

Load & explore the data

idle

Generate the customer dataset and inspect it. Real-world projects start here: load the data, look at its shape, check the target distribution, eyeball a few features. You can\u2019t model what you haven\u2019t looked at.

stage_01_load.py pandas · numpy · matplotlib
What the EDA tells us Month-to-month contracts churn at a much higher rate than longer contracts — that's a strong signal the model can use. The monthly charges distributions for churners and stayers overlap heavily, which means it won't be a great standalone feature. Both of these observations will show up in the model results later.
Stage 02

Clean & prepare

idle

Turn the raw dataframe into something a model can ingest. One-hot encode the categorical columns, stratified split into train/test to preserve class balance, then standardize numerical features so the linear model converges quickly.

stage_02_prepare.py sklearn.preprocessing · sklearn.model_selection
Why these steps matter One-hot encoding lets the model treat each contract type as a separate signal rather than as ordinal numbers. Stratified split keeps the churn rate consistent between train and test — otherwise a lucky split could fake a 100% accuracy. Standardization helps logistic regression converge fast (tree-based models don\u2019t need it but it doesn\u2019t hurt).
Stage 03

Train & compare 3 models

idle

Don\u2019t fall in love with the first model that works. Try multiple algorithms on the same data and let the metrics decide. Here: a linear baseline (Logistic Regression), a bagged tree ensemble (Random Forest), and a boosted tree ensemble (Gradient Boosting). Bonus: extract feature importance so we know what the model learned.

stage_03_train.py sklearn.linear_model · sklearn.ensemble · sklearn.metrics
The winner Gradient Boosting typically edges ahead on this kind of structured tabular data — it captures non-linear interactions between features (tenure × contract type, for example) that a linear model can\u2019t. Notice how feature importance confirms what our EDA showed: contract type and tenure are the dominant signals.
Stage 04

Evaluate & tune for business value

idle

A model isn\u2019t done when training finishes. Pick the right metric for the business problem. Here, missing a churn (FN) costs us $500 in lost revenue. A false alarm (FP) costs us $30 in unnecessary retention effort. We tune the classification threshold to maximize expected dollars saved — not accuracy.

stage_04_evaluate.py sklearn.metrics · confusion matrix · ROC
This is where most ML projects go wrong Optimizing accuracy on imbalanced data with asymmetric costs is a recipe for shipping useless models. By plugging in the actual business value per outcome ($500 per save, -$30 per false alarm), we find the threshold that maximizes revenue — often well below the default 0.5. Always do this calculation. Always.
Stage 05

Wrap as a deployable function

idle

A trained model in a notebook helps nobody. Wrap it as a callable function that takes a customer record and returns a decision. This is the exact shape a FastAPI endpoint would take (Module 11). Test it on three customer profiles spanning low/medium/high risk.

stage_05_deploy.py callable function · production-ready
You just built a production-ready ML system That predict_churn() function is the actual deliverable. Drop it into a FastAPI app (Module 11), deploy to Railway or Hugging Face Spaces, expose /predict as an HTTP endpoint, and you\u2019re live. Every modern ML system on the planet — from Netflix recommendations to fraud detection at your bank — follows this exact 5-stage shape.
Part 03 · Skills you just synthesized

11 modules of knowledge.
One project.

Every skill you learned across Course 2 appears somewhere in the capstone. Not as concepts — as code that just ran in your browser.

// Module 01
Python & Pyodide
Stages 1-5 · all of them
// Module 02
Data mindset
Stage 1: load, EDA, plot
// Module 03
Regression
Stage 3: LogisticRegression
// Module 04
Trees & forests
Stage 3: Random Forest + GBM
// Module 05
Unsupervised
Could segment customers next
// Module 06
Neural networks
Could substitute for boosting
// Module 10
Evaluation
Stage 4: confusion + ROC
// Module 11
Deploy
Stage 5: callable function
Part 04 · Your certificate

Make it official.

Type your name, download the certificate. It\u2019s yours — generated client-side, no signup required.

// Certificate of Completion
Build Your First Real AI
Course 2 · Intermediate · 12 modules · capstone delivered
Date
ProjectChurn Prediction
StackPython · sklearn
Course 2 · Complete

You built
a real ML system.

From model.predict() in a notebook to a live, business-tuned, deployable function. You can do this. You just did.

You are now intermediate-level at applied ML.
The world is full of tabular data, broken metrics, and brittle prototypes.
Now you know what to do with all of them.