Course 2 · Module 02 · 80 minutes

Real data is
a lie.
Until you clean it.

Textbook datasets are spotless. Real ones are missing values, typos, duplicates, and outliers that lie to your face. In the next 80 minutes, you'll clean a real messy dataset using the 5 moves every ML practitioner uses on day one.

You'll clean

200 messy rows

You'll learn

5 cleaning moves

You'll see

Real before/after

Open the data

employees.csv · raw

AliceBangalore85,000

BobMumbai92,000

Carlos— null —110,000

Devibangalore 10,000,000

EsmeBLR70,000

FrankDelhi— null —

GretaBombay98,000

3 cities are all "Bangalore" 7 issues

Part 01 · The map

Every messy dataset needs
the same five fixes.

There's a reason ML practitioners get bored cleaning data — it's always the same five moves. Memorize this list. Then apply it in the explorer below.

Inspect

Shape, types, sample rows. What's actually in front of you? Most beginners skip this and pay later.

df.head(), df.shape, df.dtypes

Missing

Which columns have nulls? How many? Where? Decide: drop, fill, or flag.

df.isnull().sum()

Outliers

Numbers way outside the normal range. Usually data entry errors. Sometimes real anomalies — both matter.

df.describe(), boxplot

Duplicates

Exact or fuzzy. Even one duplicated row can poison your evaluation. Find them, decide what to do.

df.duplicated()

Categorical chaos

"bangalore", "Bangalore", "BLR", "Bombay" — your model sees four different cities. Fix it.

df['col'].value_counts()

Part 02 · Hands on

Live dataset.
Five real investigations.

The explorer below contains a 200-row employee dataset with realistic problems baked in. Click through each investigation. See the issues. Apply the fix. Watch the dataset get cleaner.

Python runtime + dataset Loading Pyodide + Pandas... 0%

employees · live preview

Rows—

Issues—

Showing10

Loading...

You cleaned a real dataset.

Here's what changed. This exact workflow is what data scientists do at the start of every project.

// Before

Rows200

Missing values—

Outliers (salary)—

Duplicate emails—

City spelling variants—

Data quality score42%

// After

Rows—

Missing values0

Outliers (salary)0

Duplicate emails0

City spelling variants—

Data quality score100%

Part 03 · The lying mean

Why summary stats
lie on dirty data.

A single uncaught outlier can completely break the picture you have of your data. Here's exactly how — using a tiny example everyone has seen, but few have felt.

7 employees · monthly salary

Alice₹65,000

Bob₹72,000

Carlos₹68,000

Devi₹75,000

Esme₹70,000

Frank₹66,000

Greta₹10,000,000

Mean (average)

₹1,488,000

A salary nobody on the team earns. The mean is "the typical salary at this company" — except it's pure fiction. One typo just multiplied the average 20x.

Median (middle value)

₹70,000

The middle value when sorted. The outlier moved one position but didn't change the middle. This is what "typical" actually looks like.

The rule.

Use the mean when your data is clean and roughly symmetric. Use the median when you suspect outliers — and you should always suspect outliers. The first thing data scientists do with any salary, price, or count column is check whether the mean and median agree. If they don't, something is hiding.

Part 04 · The smell test

7 data smells.
If you see one, investigate.

Senior data scientists develop a sixth sense for when data is off. Here are the seven smells that should trigger it. Print this list.

Mean very different from median

Strong signal of outliers or a skewed distribution. Visualize before computing anything else.

Check with: df.describe() — compare mean & 50%

"100% accuracy" on a real model

Your model is too good to be true — and it isn't. You've likely got data leakage (a column secretly contains the answer).

Check by: examining feature importance · removing high-importance features and retesting

A column with mostly nulls

If >50% of a column is missing, ask: is filling it honest? Often, dropping the column entirely is the cleaner choice.

Check with: df.isnull().mean() — gives % missing per column

A categorical column with too many unique values

"City" with 4,000 unique values for 5,000 rows usually means typos, casing inconsistency, or both.

Check with: df['col'].nunique() & df['col'].value_counts()

Suspicious round numbers everywhere

If every salary ends in 000 or every age is a multiple of 5, the data was bucketed or guessed. Treat with appropriate skepticism.

Check with: df['col'] % 100 == 0 — too many trues = a problem

Impossible values

Negative ages. Heights of 12 meters. Dates before the company existed. Always sanity-check column ranges against physical reality.

Check with: df.describe() — look at min & max

A column you don't understand

If you can't explain in one sentence what a column represents, do not put it into a model. Ask whoever made the data first.

Check with: actual humans, not docs

Course 2 · Module 02 complete

You just did 60% of
every real ML project
in one module.

No exaggeration. Cleaning data is the bulk of practical machine learning. You now know the five moves and have applied them on a real dataset. The "modeling" part starts in Module 03.

Up next · Course 2 · Module 03

Supervised Learning I — Regression

Your first real ML model. You'll drag points on a scatter plot and watch a regression line fit them in real time. Then train a model on real data using scikit-learn. Predictions, not just descriptions.

Continue to Module 03

Real data isa lie.Until you clean it.

Every messy dataset needsthe same five fixes.