Textbook datasets are spotless. Real ones are missing values, typos, duplicates, and outliers that lie to your face. In the next 80 minutes, you'll clean a real messy dataset using the 5 moves every ML practitioner uses on day one.
Open the dataThere's a reason ML practitioners get bored cleaning data — it's always the same five moves. Memorize this list. Then apply it in the explorer below.
Shape, types, sample rows. What's actually in front of you? Most beginners skip this and pay later.
Which columns have nulls? How many? Where? Decide: drop, fill, or flag.
Numbers way outside the normal range. Usually data entry errors. Sometimes real anomalies — both matter.
Exact or fuzzy. Even one duplicated row can poison your evaluation. Find them, decide what to do.
"bangalore", "Bangalore", "BLR", "Bombay" — your model sees four different cities. Fix it.
The explorer below contains a 200-row employee dataset with realistic problems baked in. Click through each investigation. See the issues. Apply the fix. Watch the dataset get cleaner.
| Loading... |
|---|
Here's what changed. This exact workflow is what data scientists do at the start of every project.
Senior data scientists develop a sixth sense for when data is off. Here are the seven smells that should trigger it. Print this list.
Strong signal of outliers or a skewed distribution. Visualize before computing anything else.
Check with: df.describe() — compare mean & 50%Your model is too good to be true — and it isn't. You've likely got data leakage (a column secretly contains the answer).
Check by: examining feature importance · removing high-importance features and retestingIf >50% of a column is missing, ask: is filling it honest? Often, dropping the column entirely is the cleaner choice.
Check with: df.isnull().mean() — gives % missing per column"City" with 4,000 unique values for 5,000 rows usually means typos, casing inconsistency, or both.
Check with: df['col'].nunique() & df['col'].value_counts()If every salary ends in 000 or every age is a multiple of 5, the data was bucketed or guessed. Treat with appropriate skepticism.
Check with: df['col'] % 100 == 0 — too many trues = a problemNegative ages. Heights of 12 meters. Dates before the company existed. Always sanity-check column ranges against physical reality.
Check with: df.describe() — look at min & maxIf you can't explain in one sentence what a column represents, do not put it into a model. Ask whoever made the data first.
Check with: actual humans, not docsAim for 4/5. Wrong answers explain themselves.
No exaggeration. Cleaning data is the bulk of practical machine learning. You now know the five moves and have applied them on a real dataset. The "modeling" part starts in Module 03.
Continue to Module 03