Course 2 · Module 12 · Capstone Project

Part 01 · The brief

Predict who's about to churn.

The setup

You're a data scientist at NorthWind Telecom. The retention team comes to you with a clear problem: every month, roughly 27% of new customers cancel within their first year. Acquiring a new customer costs ~$500 in marketing. Retaining one is almost free — if you can identify them in time.

They give you a snapshot of 1,000 historical customers and ask you to build a system that flags who's likely to churn next month — so retention specialists can call those customers first.

That's the whole brief. The rest is up to you. Below: five stages, each with real Python code running scikit-learn in your browser. By the end you'll have a callable predict_churn() function — the final deliverable.

// The dataset

NorthWind customers

Customers1,000

Features7 raw

Targetchurn (0/1)

Approx churn rate~27%

Revenue per save$500

Cost per false alarm$30

Train/test split80 / 20

Imbalanced?Mildly

Part 02 · The pipeline

Five stages.
All real scikit-learn.

Run them one at a time, or click "Run all stages" to watch the entire pipeline execute top to bottom. Each stage builds on the previous one — variables persist across stages, so you can\u2019t skip ahead. The code is editable: tweak things if you\u2019re curious.

Loading Pyodide runtime... Downloading Python + numpy + pandas + matplotlib + scikit-learn (~30s on first load)

Stage 01

Load & explore the data

idle

Generate the customer dataset and inspect it. Real-world projects start here: load the data, look at its shape, check the target distribution, eyeball a few features. You can\u2019t model what you haven\u2019t looked at.

            ▶ stage_01_load.py
            pandas · numpy · matplotlib
          

What the EDA tells us Month-to-month contracts churn at a much higher rate than longer contracts — that's a strong signal the model can use. The monthly charges distributions for churners and stayers overlap heavily, which means it won't be a great standalone feature. Both of these observations will show up in the model results later.

Stage 02

Clean & prepare

idle

Turn the raw dataframe into something a model can ingest. One-hot encode the categorical columns, stratified split into train/test to preserve class balance, then standardize numerical features so the linear model converges quickly.

            ▶ stage_02_prepare.py
            sklearn.preprocessing · sklearn.model_selection
          

Why these steps matter One-hot encoding lets the model treat each contract type as a separate signal rather than as ordinal numbers. Stratified split keeps the churn rate consistent between train and test — otherwise a lucky split could fake a 100% accuracy. Standardization helps logistic regression converge fast (tree-based models don\u2019t need it but it doesn\u2019t hurt).

Stage 03

Train & compare 3 models

idle

Don\u2019t fall in love with the first model that works. Try multiple algorithms on the same data and let the metrics decide. Here: a linear baseline (Logistic Regression), a bagged tree ensemble (Random Forest), and a boosted tree ensemble (Gradient Boosting). Bonus: extract feature importance so we know what the model learned.

            ▶ stage_03_train.py
            sklearn.linear_model · sklearn.ensemble · sklearn.metrics
          

The winner Gradient Boosting typically edges ahead on this kind of structured tabular data — it captures non-linear interactions between features (tenure × contract type, for example) that a linear model can\u2019t. Notice how feature importance confirms what our EDA showed: contract type and tenure are the dominant signals.

Stage 04

Evaluate & tune for business value

idle

A model isn\u2019t done when training finishes. Pick the right metric for the business problem. Here, missing a churn (FN) costs us $500 in lost revenue. A false alarm (FP) costs us $30 in unnecessary retention effort. We tune the classification threshold to maximize expected dollars saved — not accuracy.

            ▶ stage_04_evaluate.py
            sklearn.metrics · confusion matrix · ROC
          

This is where most ML projects go wrong Optimizing accuracy on imbalanced data with asymmetric costs is a recipe for shipping useless models. By plugging in the actual business value per outcome ($500 per save, -$30 per false alarm), we find the threshold that maximizes revenue — often well below the default 0.5. Always do this calculation. Always.

Stage 05

Wrap as a deployable function

idle

A trained model in a notebook helps nobody. Wrap it as a callable function that takes a customer record and returns a decision. This is the exact shape a FastAPI endpoint would take (Module 11). Test it on three customer profiles spanning low/medium/high risk.

            ▶ stage_05_deploy.py
            callable function · production-ready
          

You just built a production-ready ML system That predict_churn() function is the actual deliverable. Drop it into a FastAPI app (Module 11), deploy to Railway or Hugging Face Spaces, expose /predict as an HTTP endpoint, and you\u2019re live. Every modern ML system on the planet — from Netflix recommendations to fraud detection at your bank — follows this exact 5-stage shape.

One project.
Everything
you've learned.

Predict who's about to churn.

The setup

NorthWind customers

Five stages.
All real scikit-learn.

Load & explore the data

Clean & prepare

Train & compare 3 models

Evaluate & tune for business value

Wrap as a deployable function

11 modules of knowledge.
One project.

Python & Pyodide

Data mindset

Regression

Trees & forests

Unsupervised

Neural networks

Evaluation

Deploy

Make it official.

You built
a real ML system.

One project.Everythingyou've learned.

Predict who's about to churn.

The setup

NorthWind customers

Five stages.All real scikit-learn.

Load & explore the data

Clean & prepare

Train & compare 3 models

Evaluate & tune for business value

Wrap as a deployable function

11 modules of knowledge.One project.

Python & Pyodide

Data mindset

Regression

Trees & forests

Unsupervised

Neural networks

Evaluation

Deploy

Make it official.

You builta real ML system.

One project.
Everything
you've learned.

Five stages.
All real scikit-learn.

11 modules of knowledge.
One project.

You built
a real ML system.