AI Skill Course Course 2 · Intermediate
Module 02 of 12
Course 2 · Module 02 · 80 minutes

Real data is
a lie.
Until you clean it.

Textbook datasets are spotless. Real ones are missing values, typos, duplicates, and outliers that lie to your face. In the next 80 minutes, you'll clean a real messy dataset using the 5 moves every ML practitioner uses on day one.

You'll clean
200 messy rows
You'll learn
5 cleaning moves
You'll see
Real before/after
Open the data
employees.csv · raw
namecitysalary
AliceBangalore85,000
BobMumbai92,000
Carlos— null —110,000
Devibangalore 10,000,000
EsmeBLR70,000
FrankDelhi— null —
GretaBombay98,000
3 cities are all "Bangalore" 7 issues
Part 01 · The map

Every messy dataset needs
the same five fixes.

There's a reason ML practitioners get bored cleaning data — it's always the same five moves. Memorize this list. Then apply it in the explorer below.

01
Inspect

Shape, types, sample rows. What's actually in front of you? Most beginners skip this and pay later.

df.head(), df.shape, df.dtypes
02
Missing

Which columns have nulls? How many? Where? Decide: drop, fill, or flag.

df.isnull().sum()
03
Outliers

Numbers way outside the normal range. Usually data entry errors. Sometimes real anomalies — both matter.

df.describe(), boxplot
04
Duplicates

Exact or fuzzy. Even one duplicated row can poison your evaluation. Find them, decide what to do.

df.duplicated()
05
Categorical chaos

"bangalore", "Bangalore", "BLR", "Bombay" — your model sees four different cities. Fix it.

df['col'].value_counts()
Part 02 · Hands on

Live dataset.
Five real investigations.

The explorer below contains a 200-row employee dataset with realistic problems baked in. Click through each investigation. See the issues. Apply the fix. Watch the dataset get cleaner.

Py
Python runtime + dataset Loading Pyodide + Pandas... 0%
Loading
employees · live preview
Rows
Issues
Showing10
Loading...

You cleaned a real dataset.

Here's what changed. This exact workflow is what data scientists do at the start of every project.

// Before
Rows200
Missing values
Outliers (salary)
Duplicate emails
City spelling variants
Data quality score42%
// After
Rows
Missing values0
Outliers (salary)0
Duplicate emails0
City spelling variants
Data quality score100%
Part 03 · The lying mean

Why summary stats
lie on dirty data.

A single uncaught outlier can completely break the picture you have of your data. Here's exactly how — using a tiny example everyone has seen, but few have felt.

7 employees · monthly salary
Alice₹65,000
Bob₹72,000
Carlos₹68,000
Devi₹75,000
Esme₹70,000
Frank₹66,000
Greta₹10,000,000
Mean (average)
₹1,488,000
A salary nobody on the team earns. The mean is "the typical salary at this company" — except it's pure fiction. One typo just multiplied the average 20x.
Median (middle value)
₹70,000
The middle value when sorted. The outlier moved one position but didn't change the middle. This is what "typical" actually looks like.
The rule.

Use the mean when your data is clean and roughly symmetric. Use the median when you suspect outliers — and you should always suspect outliers. The first thing data scientists do with any salary, price, or count column is check whether the mean and median agree. If they don't, something is hiding.

Part 04 · The smell test

7 data smells.
If you see one, investigate.

Senior data scientists develop a sixth sense for when data is off. Here are the seven smells that should trigger it. Print this list.

01
Mean very different from median

Strong signal of outliers or a skewed distribution. Visualize before computing anything else.

Check with: df.describe() — compare mean & 50%
02
"100% accuracy" on a real model

Your model is too good to be true — and it isn't. You've likely got data leakage (a column secretly contains the answer).

Check by: examining feature importance · removing high-importance features and retesting
03
A column with mostly nulls

If >50% of a column is missing, ask: is filling it honest? Often, dropping the column entirely is the cleaner choice.

Check with: df.isnull().mean() — gives % missing per column
04
A categorical column with too many unique values

"City" with 4,000 unique values for 5,000 rows usually means typos, casing inconsistency, or both.

Check with: df['col'].nunique() & df['col'].value_counts()
05
Suspicious round numbers everywhere

If every salary ends in 000 or every age is a multiple of 5, the data was bucketed or guessed. Treat with appropriate skepticism.

Check with: df['col'] % 100 == 0 — too many trues = a problem
06
Impossible values

Negative ages. Heights of 12 meters. Dates before the company existed. Always sanity-check column ranges against physical reality.

Check with: df.describe() — look at min & max
07
A column you don't understand

If you can't explain in one sentence what a column represents, do not put it into a model. Ask whoever made the data first.

Check with: actual humans, not docs
Part 05 · Knowledge check

Five questions on what
you just cleaned.

Aim for 4/5. Wrong answers explain themselves.

Question 01 of 05

0/5

Continue
Course 2 · Module 02 complete

You just did 60% of
every real ML project
in one module.

No exaggeration. Cleaning data is the bulk of practical machine learning. You now know the five moves and have applied them on a real dataset. The "modeling" part starts in Module 03.

Up next · Course 2 · Module 03

Supervised Learning I — Regression

Your first real ML model. You'll drag points on a scatter plot and watch a regression line fit them in real time. Then train a model on real data using scikit-learn. Predictions, not just descriptions.

Continue to Module 03