A model that predicts "no fraud" for every transaction in a dataset that's 99% non-fraud will be 99% accurate. It will also be completely useless. Evaluation — choosing the right metrics for your problem — is what separates ML you can ship from ML that fools you. Most of the field's worst mistakes come from skipping this module.
See the trapYou have 10,000 credit card transactions. Only 1% (100 of them) are actually fraud. You build three models and report their accuracy. Which one would you ship?
"It's never fraud." Said about every transaction.
50/50 random guess on each transaction.
Trained logistic regression on transaction features.
The lesson: Model A has the highest accuracy but catches zero fraud. Accuracy alone is meaningless when classes are imbalanced. You need precision (of what I flagged, how much was real?) and recall (of all real fraud, how much did I catch?). Both matter, and they trade off — which is exactly what you're about to feel below.
Below is a realistic classifier: 200 examples, each with a true label and a predicted probability. The gold line is the classification threshold (default 0.5). Drag it left or right — watch the confusion matrix shift, watch precision and recall trade off, watch your ROC point move. The full lifecycle of "tuning a classifier" in one panel.
Cyan bars = actual positives (e.g. fraud). Rose bars = actual negatives. The gold line is your threshold — anything to the right of it is classified as positive. Drag the line: as it moves left, you predict positive more often (more recall, less precision); as it moves right, you predict positive less often (more precision, less recall). The ROC curve below shows every possible threshold's tradeoff.
There is no "best" metric. There's only the right metric for your problem. Memorize what each one is actually asking — that's the practical skill.
"Of all my predictions, what fraction were right?"
"Of everything I flagged as positive, how much really was?"
"Of all real positives, how many did I actually catch?"
"How am I doing overall, given that I care about both precision and recall?"
Three classic ML problems. Three different metrics to optimize. Get this wrong in production and the model fails — even at "99% accuracy".
A blood test flags people for further testing. False positives (healthy person flagged) cause anxiety + an extra test. False negatives (cancer missed) can be fatal. Wildly asymmetric costs.
An email gets flagged as spam. False positive (real email → spam folder) means you miss your friend's invitation. False negative (spam → inbox) means you see one more piece of junk. Way more costly to lose real mail.
The system suggests a movie. Showing a bad recommendation costs ~nothing (user scrolls past). Missing a great one costs some engagement. Roughly symmetric, slight tilt toward not annoying users.
Aim for 4/5. Wrong answers explain themselves.
You felt the precision/recall tradeoff with your own fingers on the threshold. You know that "99% accurate" is often a trap. You can pick the right metric for the problem at hand. That's what separates ML practitioners from ML enthusiasts — and what saves products from quietly failing in production.
Continue to Module 11