Course 2 · Module 07 · 70 minutes

When the
machine
can see.

Until now, your models took numbers in. Computer vision takes pixels — millions of them — and turns them into meaning. "There's a cat." "That's a stop sign." "Tumor detected." This module loads a real pre-trained model in your browser and runs it on your webcam. Live.

You'll classify

Live webcam frames

You'll see

How CNN filters work

You'll use

TensorFlow.js + MobileNet

See the machine see

Part 01 · The challenge

Why teaching a machine
to see was very hard.

Until 2012, computers couldn't reliably tell cats from dogs in photos. The problem isn't pixels — it's what's in them. Three reasons it's harder than predicting numbers.

// Challenge 01

Dimensionality explosion

A modest 224×224 RGB photo has 150,528 numbers. Even a small neural network needs to figure out how those interact — and how to ignore lighting, angle, background.

By the numbers 224 × 224 × 3 = 150,528 inputs
Single dense layer of 1000 = 150M weights
Just for one layer

// Challenge 02

Spatial structure matters

The pixel "ear" matters more when it's near the "head" than near the "foot". Position and relationships are everything. Tabular ML ignores this — vision can't.

The CNN trick Small filters slide across the image,
looking for local patterns.
Position-aware, parameter-efficient.

// Challenge 03

Infinite variation

The same cat looks different from any angle, in any lighting, at any distance, partly occluded by a chair. A robust model must learn that "cat-ness" survives all of these.

Solution: huge training sets ImageNet has 14M labeled images.
Pre-trained models learn from all of them.
You reuse that. (see Part 06.)

Part 02 · Hands on · Live AI

A real CNN.
In your browser. Now.

We're loading MobileNet — a 14MB pre-trained convolutional network that can recognize 1000 categories. Once loaded, point your webcam at things (or upload a photo) and watch the predictions update in real time. No server. Everything runs locally.

How it works.

Wait for the model to load (~14MB download, one-time). Then choose Webcam or Upload. Point at your face, a pet, a coffee mug, a houseplant — see what the model thinks it is. Predictions show top-5 with confidence percentages. Note: this is the exact same model used in Google Lens and many phone-camera AI features.

MobileNet (TensorFlow.js) Initializing...

Input source

Pick how to feed images to the model

Ready when you are

Click "Start camera" below — or switch to upload

● All processing is local. Nothing leaves your device. The model runs entirely in this browser tab.

Top predictions

Updated live · ~5/sec

Start the camera or upload an image to see predictions

Part 03 · How CNNs see

Four operations.
That's all a CNN does.

A Convolutional Neural Network is just these four things stacked many times. Once you understand them, ResNet, MobileNet, EfficientNet, YOLO — they're all variations on this theme.

Convolution

A tiny filter (3×3) slides across the image. At each position, it detects whether a specific pattern (edge, color blob, texture) is present.

// thousands of filters per layer

Activation

Each filter's output is passed through a non-linearity (usually ReLU: max(0, x)). This is what makes the network able to learn complex patterns instead of just linear ones.

// ReLU is the standard choice

Pooling

Shrinks the image. Keeps the strongest signals (max-pooling), throws away precise pixel position. Builds invariance to small shifts and reduces compute load.

// usually 2×2 max pool, halves size

Stack & classify

Repeat conv → activate → pool many times. Early layers detect edges; deeper layers detect "wheel" or "eye"; final layers classify ("car" or "face").

// MobileNet has ~28 such layers

Part 04 · Hands on · Convolution

See what the
first layer sees.

CNNs apply small filters (kernels) to the image. Here's what real CNN filters do — applied to a synthetic image you can swap. Click each filter to see how a 3×3 kernel changes the picture. Edge detection is what every CNN learns to do in its first layer.

The CNN's first step.

Each "filter" below is a 3×3 grid of numbers. To apply it, slide the grid across every pixel of the image, multiplying neighbors and summing. The result is a new image that highlights whatever pattern that filter is tuned for. Real CNNs learn millions of such filters automatically.

Original // the input image

// Filter kernel

// Kernel values

Sharp transitions in pixel brightness (edges of objects) become bright; smooth areas become black. This is what every CNN's first layer learns to do.

After filter // what the next layer sees

// The takeaway

A single filter detects one feature type. A CNN layer applies dozens or hundreds of filters in parallel — so each pixel in the output represents "how much of each pattern is at this location."

Stack layers, and the patterns get more abstract. Layer 1: edges. Layer 5: textures. Layer 15: "this is a wheel". Layer 25: "this is a car."

Part 05 · The cheat code

Why you'll never train
a CNN from scratch.

Training a CNN from zero requires thousands of GPU-hours and millions of labeled images. Nobody does this. Instead: take a pre-trained model (like MobileNet, trained on 14M images), and fine-tune it on your specific task. This is "transfer learning" — the most important practical skill in modern CV.

From scratch

The hard way

Training dataMillions of labeled images

GPU timeWeeks

Cost$5,000+ in cloud GPUs

ExpertiseDeep ML knowledge

When usedAlmost never anymore

Transfer learning

The smart way

Training data100-1000 labeled images

GPU timeMinutes to hours

CostOften free (Colab)

ExpertiseA few lines of code

When used~99% of production CV

The intuition

A pre-trained model like MobileNet already knows what edges, textures, shapes, eyes, wheels, and fur look like — because it was trained on 14 million ImageNet photos. To make it recognize your 5 specific products, you don't re-teach it everything. You just replace the last layer (the "classifier") with one for your 5 classes, and fine-tune it on a few hundred of your own labeled images. That's transfer learning. It's how every "AI for retail" / "AI for medical imaging" / "AI for security" product gets built today.

Course 2 · Module 07 complete

You ran a real CNN
in your browser.

You saw what the first layer sees. You loaded MobileNet and classified live video. You know that nobody trains CV from scratch anymore — and you know why. Computer vision stops being magic. It's just a stack of convolutions trained on a lot of pictures.

Up next · Course 2 · Module 08

NLP Foundations

From pixels to words. You'll explore how machines turn language into numbers — tokenization, embeddings, semantic similarity — and visualize a 3D word embedding space where related words cluster naturally.

Continue to Module 08

When themachinecan see.

Why teaching a machineto see was very hard.

Dimensionality explosion

Spatial structure matters

Infinite variation

A real CNN.In your browser. Now.

Four operations.That's all a CNN does.