Course 2 · Module 10 · 70 minutes

99% accurate
doesn't mean
good.

A model that predicts "no fraud" for every transaction in a dataset that's 99% non-fraud will be 99% accurate. It will also be completely useless. Evaluation — choosing the right metrics for your problem — is what separates ML you can ship from ML that fools you. Most of the field's worst mistakes come from skipping this module.

You'll drag

A live threshold

You'll watch

Precision vs recall trade

You'll learn

When to use which metric

See the trap

Part 01 · The accuracy trap

Three models.
One misleading winner.

Scenario: fraud detection.

You have 10,000 credit card transactions. Only 1% (100 of them) are actually fraud. You build three models and report their accuracy. Which one would you ship?

Catastrophic

Model A: "Predict no"

"It's never fraud." Said about every transaction.

Accuracy 99.0%

Recall (fraud caught) 0%

Precision —

Fraud caught 0 / 100

99% accurate, totally useless. Catches zero fraud.

Random

Model B: Coin flip

50/50 random guess on each transaction.

Accuracy 50.0%

Recall (fraud caught) 50%

Precision 1%

Fraud caught 50 / 100

Catches half the fraud but flags 5000 innocent customers.

The winner

Model C: Real classifier

Trained logistic regression on transaction features.

Accuracy 96.0%

Recall (fraud caught) 85%

Precision 19%

Fraud caught 85 / 100

Lower accuracy than A. Vastly more useful.

The lesson: Model A has the highest accuracy but catches zero fraud. Accuracy alone is meaningless when classes are imbalanced. You need precision (of what I flagged, how much was real?) and recall (of all real fraud, how much did I catch?). Both matter, and they trade off — which is exactly what you're about to feel below.

Part 02 · Hands on · The threshold lever

Drag the threshold.
Watch everything change.

Below is a realistic classifier: 200 examples, each with a true label and a predicted probability. The gold line is the classification threshold (default 0.5). Drag it left or right — watch the confusion matrix shift, watch precision and recall trade off, watch your ROC point move. The full lifecycle of "tuning a classifier" in one panel.

How to read it.

Cyan bars = actual positives (e.g. fraud). Rose bars = actual negatives. The gold line is your threshold — anything to the right of it is classified as positive. Drag the line: as it moves left, you predict positive more often (more recall, less precision); as it moves right, you predict positive less often (more precision, less recall). The ROC curve below shows every possible threshold's tradeoff.

Predicted probability distribution

200 examples · drag the gold line to change threshold

Threshold:

ROC curve · your current point in gold AUC: —

Confusion matrix

TP · top-left

—

true positives · caught

FN · top-right

—

false negatives · missed

FP · bottom-left

—

false positives · false alarm

TN · bottom-right

—

true negatives · correctly cleared

Live metrics

Accuracy (TP+TN)/total —

Precision TP/(TP+FP) —

Recall TP/(TP+FN) —

F1 2·P·R/(P+R) —

Part 03 · The four metrics

Each metric answers
a different question.

There is no "best" metric. There's only the right metric for your problem. Memorize what each one is actually asking — that's the practical skill.

Big picture

Accuracy

"Of all my predictions, what fraction were right?"

(TP + TN) / total

Use when Classes are roughly balanced AND the cost of false positives ≈ cost of false negatives. Almost never in practice.

When flagging

Precision

"Of everything I flagged as positive, how much really was?"

TP / (TP + FP)

Use when False positives are expensive. Examples: spam filter (don't trash real email), targeted ads (don't waste budget), drug recommendations.

When catching

Recall

"Of all real positives, how many did I actually catch?"

TP / (TP + FN)

Use when False negatives are expensive. Examples: cancer screening (don't miss disease), fraud detection (don't miss fraud), security alerts.

When balanced

F1 Score

"How am I doing overall, given that I care about both precision and recall?"

2 · P · R / (P + R)

Use when You can't decide between precision and recall (and they matter roughly equally). F1 is the "harmonic mean" — penalizes the worse one heavily.

Part 04 · Real-world calls

For each problem,
there's one right metric.

Three classic ML problems. Three different metrics to optimize. Get this wrong in production and the model fails — even at "99% accuracy".

🏥

// Scenario 01

Cancer screening

A blood test flags people for further testing. False positives (healthy person flagged) cause anxiety + an extra test. False negatives (cancer missed) can be fatal. Wildly asymmetric costs.

// Optimize for

Recall (above 95% if possible)

📧

// Scenario 02

Spam filter

An email gets flagged as spam. False positive (real email → spam folder) means you miss your friend's invitation. False negative (spam → inbox) means you see one more piece of junk. Way more costly to lose real mail.

// Optimize for

Precision (high, like 99%+)

🎬

// Scenario 03

Movie recommendation

The system suggests a movie. Showing a bad recommendation costs ~nothing (user scrolls past). Missing a great one costs some engagement. Roughly symmetric, slight tilt toward not annoying users.

// Optimize for

F1 or top-K precision

Course 2 · Module 10 complete

You can now tell
good models from lucky ones.

You felt the precision/recall tradeoff with your own fingers on the threshold. You know that "99% accurate" is often a trap. You can pick the right metric for the problem at hand. That's what separates ML practitioners from ML enthusiasts — and what saves products from quietly failing in production.

Up next · Course 2 · Module 11

Deploy It! — From Notebook to Production

Models in notebooks don't help anyone. Module 11 covers the bridge to production: wrapping your model in a FastAPI service, deploying to Hugging Face Spaces or Streamlit Cloud, and the basics of monitoring once it's live.

Continue to Module 11

99% accuratedoesn't meangood.

Three models.One misleading winner.

Model A: "Predict no"

Model B: Coin flip

Model C: Real classifier

Drag the threshold.Watch everything change.

Each metric answersa different question.

Accuracy

Precision

Recall

F1 Score

For each problem,there's one right metric.

Cancer screening

Spam filter

Movie recommendation

Five questions on whatyou just measured.

You can now tellgood models from lucky ones.

Deploy It! — From Notebook to Production

99% accurate
doesn't mean
good.

Three models.
One misleading winner.

Drag the threshold.
Watch everything change.

Each metric answers
a different question.

For each problem,
there's one right metric.

Five questions on what
you just measured.

You can now tell
good models from lucky ones.