AI Skill Course Course 2 · Intermediate
Module 10 of 12
Course 2 · Module 10 · 70 minutes

99% accurate
doesn't mean
good.

A model that predicts "no fraud" for every transaction in a dataset that's 99% non-fraud will be 99% accurate. It will also be completely useless. Evaluation — choosing the right metrics for your problem — is what separates ML you can ship from ML that fools you. Most of the field's worst mistakes come from skipping this module.

You'll drag
A live threshold
You'll watch
Precision vs recall trade
You'll learn
When to use which metric
See the trap
The Confusion Matrix predicted + predicted − actual + actual − TP 42 FN 8 FP 15 TN 135 four cells · everything starts here
Part 01 · The accuracy trap

Three models.
One misleading winner.

Scenario: fraud detection.

You have 10,000 credit card transactions. Only 1% (100 of them) are actually fraud. You build three models and report their accuracy. Which one would you ship?

Catastrophic

Model A: "Predict no"

"It's never fraud." Said about every transaction.

Accuracy 99.0%
Recall (fraud caught) 0%
Precision
Fraud caught 0 / 100
99% accurate, totally useless. Catches zero fraud.
Random

Model B: Coin flip

50/50 random guess on each transaction.

Accuracy 50.0%
Recall (fraud caught) 50%
Precision 1%
Fraud caught 50 / 100
Catches half the fraud but flags 5000 innocent customers.
The winner

Model C: Real classifier

Trained logistic regression on transaction features.

Accuracy 96.0%
Recall (fraud caught) 85%
Precision 19%
Fraud caught 85 / 100
Lower accuracy than A. Vastly more useful.

The lesson: Model A has the highest accuracy but catches zero fraud. Accuracy alone is meaningless when classes are imbalanced. You need precision (of what I flagged, how much was real?) and recall (of all real fraud, how much did I catch?). Both matter, and they trade off — which is exactly what you're about to feel below.

Part 02 · Hands on · The threshold lever

Drag the threshold.
Watch everything change.

Below is a realistic classifier: 200 examples, each with a true label and a predicted probability. The gold line is the classification threshold (default 0.5). Drag it left or right — watch the confusion matrix shift, watch precision and recall trade off, watch your ROC point move. The full lifecycle of "tuning a classifier" in one panel.

How to read it.

Cyan bars = actual positives (e.g. fraud). Rose bars = actual negatives. The gold line is your threshold — anything to the right of it is classified as positive. Drag the line: as it moves left, you predict positive more often (more recall, less precision); as it moves right, you predict positive less often (more precision, less recall). The ROC curve below shows every possible threshold's tradeoff.

Predicted probability distribution
200 examples · drag the gold line to change threshold
Predicted probability → Count 0.50 actual positive (60) actual negative (140)
Threshold:
ROC curve · your current point in gold AUC: —
False positive rate True positive rate 0 0.5 1 0 0.5 1
Confusion matrix
TP · top-left
true positives · caught
FN · top-right
false negatives · missed
FP · bottom-left
false positives · false alarm
TN · bottom-right
true negatives · correctly cleared
predicted + predicted −
Live metrics
Accuracy (TP+TN)/total
Precision TP/(TP+FP)
Recall TP/(TP+FN)
F1 2·P·R/(P+R)
Part 03 · The four metrics

Each metric answers
a different question.

There is no "best" metric. There's only the right metric for your problem. Memorize what each one is actually asking — that's the practical skill.

Big picture

Accuracy

"Of all my predictions, what fraction were right?"

(TP + TN) / total
Use when Classes are roughly balanced AND the cost of false positives ≈ cost of false negatives. Almost never in practice.
When flagging

Precision

"Of everything I flagged as positive, how much really was?"

TP / (TP + FP)
Use when False positives are expensive. Examples: spam filter (don't trash real email), targeted ads (don't waste budget), drug recommendations.
When catching

Recall

"Of all real positives, how many did I actually catch?"

TP / (TP + FN)
Use when False negatives are expensive. Examples: cancer screening (don't miss disease), fraud detection (don't miss fraud), security alerts.
When balanced

F1 Score

"How am I doing overall, given that I care about both precision and recall?"

2 · P · R / (P + R)
Use when You can't decide between precision and recall (and they matter roughly equally). F1 is the "harmonic mean" — penalizes the worse one heavily.
Part 04 · Real-world calls

For each problem,
there's one right metric.

Three classic ML problems. Three different metrics to optimize. Get this wrong in production and the model fails — even at "99% accuracy".

🏥
// Scenario 01

Cancer screening

A blood test flags people for further testing. False positives (healthy person flagged) cause anxiety + an extra test. False negatives (cancer missed) can be fatal. Wildly asymmetric costs.

// Optimize for
Recall (above 95% if possible)
📧
// Scenario 02

Spam filter

An email gets flagged as spam. False positive (real email → spam folder) means you miss your friend's invitation. False negative (spam → inbox) means you see one more piece of junk. Way more costly to lose real mail.

// Optimize for
Precision (high, like 99%+)
🎬
// Scenario 03

Movie recommendation

The system suggests a movie. Showing a bad recommendation costs ~nothing (user scrolls past). Missing a great one costs some engagement. Roughly symmetric, slight tilt toward not annoying users.

// Optimize for
F1 or top-K precision
Part 05 · Knowledge check

Five questions on what
you just measured.

Aim for 4/5. Wrong answers explain themselves.

Question 01 of 05

0/5

Continue
Course 2 · Module 10 complete

You can now tell
good models from lucky ones.

You felt the precision/recall tradeoff with your own fingers on the threshold. You know that "99% accurate" is often a trap. You can pick the right metric for the problem at hand. That's what separates ML practitioners from ML enthusiasts — and what saves products from quietly failing in production.

Up next · Course 2 · Module 11

Deploy It! — From Notebook to Production

Models in notebooks don't help anyone. Module 11 covers the bridge to production: wrapping your model in a FastAPI service, deploying to Hugging Face Spaces or Streamlit Cloud, and the basics of monitoring once it's live.

Continue to Module 11