Until now, your models took numbers in. Computer vision takes pixels — millions of them — and turns them into meaning. "There's a cat." "That's a stop sign." "Tumor detected." This module loads a real pre-trained model in your browser and runs it on your webcam. Live.
See the machine seeUntil 2012, computers couldn't reliably tell cats from dogs in photos. The problem isn't pixels — it's what's in them. Three reasons it's harder than predicting numbers.
A modest 224×224 RGB photo has 150,528 numbers. Even a small neural network needs to figure out how those interact — and how to ignore lighting, angle, background.
The pixel "ear" matters more when it's near the "head" than near the "foot". Position and relationships are everything. Tabular ML ignores this — vision can't.
The same cat looks different from any angle, in any lighting, at any distance, partly occluded by a chair. A robust model must learn that "cat-ness" survives all of these.
We're loading MobileNet — a 14MB pre-trained convolutional network that can recognize 1000 categories. Once loaded, point your webcam at things (or upload a photo) and watch the predictions update in real time. No server. Everything runs locally.
Wait for the model to load (~14MB download, one-time). Then choose Webcam or Upload. Point at your face, a pet, a coffee mug, a houseplant — see what the model thinks it is. Predictions show top-5 with confidence percentages. Note: this is the exact same model used in Google Lens and many phone-camera AI features.
A Convolutional Neural Network is just these four things stacked many times. Once you understand them, ResNet, MobileNet, EfficientNet, YOLO — they're all variations on this theme.
A tiny filter (3×3) slides across the image. At each position, it detects whether a specific pattern (edge, color blob, texture) is present.
Each filter's output is passed through a non-linearity (usually ReLU: max(0, x)). This is what makes the network able to learn complex patterns instead of just linear ones.
Shrinks the image. Keeps the strongest signals (max-pooling), throws away precise pixel position. Builds invariance to small shifts and reduces compute load.
Repeat conv → activate → pool many times. Early layers detect edges; deeper layers detect "wheel" or "eye"; final layers classify ("car" or "face").
CNNs apply small filters (kernels) to the image. Here's what real CNN filters do — applied to a synthetic image you can swap. Click each filter to see how a 3×3 kernel changes the picture. Edge detection is what every CNN learns to do in its first layer.
Each "filter" below is a 3×3 grid of numbers. To apply it, slide the grid across every pixel of the image, multiplying neighbors and summing. The result is a new image that highlights whatever pattern that filter is tuned for. Real CNNs learn millions of such filters automatically.
Sharp transitions in pixel brightness (edges of objects) become bright; smooth areas become black. This is what every CNN's first layer learns to do.
A single filter detects one feature type. A CNN layer applies dozens or hundreds of filters in parallel — so each pixel in the output represents "how much of each pattern is at this location."
Stack layers, and the patterns get more abstract. Layer 1: edges. Layer 5: textures. Layer 15: "this is a wheel". Layer 25: "this is a car."
Training a CNN from zero requires thousands of GPU-hours and millions of labeled images. Nobody does this. Instead: take a pre-trained model (like MobileNet, trained on 14M images), and fine-tune it on your specific task. This is "transfer learning" — the most important practical skill in modern CV.
A pre-trained model like MobileNet already knows what edges, textures, shapes, eyes, wheels, and fur look like — because it was trained on 14 million ImageNet photos. To make it recognize your 5 specific products, you don't re-teach it everything. You just replace the last layer (the "classifier") with one for your 5 classes, and fine-tune it on a few hundred of your own labeled images. That's transfer learning. It's how every "AI for retail" / "AI for medical imaging" / "AI for security" product gets built today.
Aim for 4/5. Wrong answers explain themselves.
You saw what the first layer sees. You loaded MobileNet and classified live video. You know that nobody trains CV from scratch anymore — and you know why. Computer vision stops being magic. It's just a stack of convolutions trained on a lot of pictures.
Continue to Module 08