Course 1 · Module 05 · 42 minutes

AI that sees,
hears,
& creates.

Three real AI models. Running in your browser. Right now. Look into your webcam — the AI will see you. Talk into your mic — it'll transcribe you. Pick a prompt — watch it generate from pure noise. Welcome to the part of the course that feels like magic.

You'll run

3 live models

In your

Browser

No installs

Zero

See the magic

Vision

// Computer vision

Speech & audio

// Speech-to-text · noise cancel

Generation

// Images · text · sound · video

Part 01 · AI that sees

How a machine can see
without ever having eyes.

Every image is just a grid of numbers (brightness values). Computer vision is the art of finding patterns in those numbers — patterns that mean "cat" or "stop sign" or "tumor." Three levels of skill:

Classification

// What is it?

"This image contains a cat." One label for the whole image. The simplest form of vision AI — and where the modern boom started in 2012.

Detection

// What & where?

"There's a cat here, a dog there, a couch in the back." Multiple objects, each with a box around it. This is what the model below does.

Segmentation

// Which pixels?

"These exact pixels are cat-pixels. These are sofa-pixels." Used in medical imaging, self-driving cars, and any task that needs precise shape.

🌟 Try it: live object detection

Idle

Real object-detection AI runs in your browser. Click below and point your camera at anything. It'll find people, objects, animals — 80 categories trained on millions of images.

or scroll down to upload an image instead

Detections

Nothing yet — start the camera

FPS: — Total: 0

📷 Upload image instead

What's happening under the hood That's a real convolutional neural network — COCO-SSD, trained on 330,000 images by Microsoft Research — running entirely on your device. No data leaves your browser. The model is ~10MB, downloaded once. After that, it processes each frame in ~50 milliseconds.

Part 02 · AI that hears

From sound waves to written words,
in real time.

Every sound is a wave. Speech-to-text turns those waves into text by breaking them into tiny chunks, identifying phonemes (the building blocks of speech), and stringing them into words. Try it yourself.

Tap the mic and start talking — your words will appear here as you speak...

Your browser doesn't fully support the Web Speech API. Try Chrome or Edge for the best experience.

Two AI models, working together First, an acoustic model converts the raw audio waveform into phonemes (the building blocks of speech, like "k" or "ah"). Then a language model — yes, similar to ChatGPT's family — turns those phonemes into actual words, using context to decide between "I see" and "icy."

Part 03 · AI that creates

From pure noise
to a beautiful image.

Tools like DALL-E and Midjourney don't paint. They denoise. They start with a square of random pixels and gradually subtract noise until an image appears. Sounds impossible. Watch:

Step 0 / 30 · Pure noise

"a sunset over mountains"

Just static.

Pure random noise. No structure. No meaning. This is where every generated image starts.

Denoising progress 0 / 30

Try a different prompt

"sunset over mountains" "ocean horizon at dusk" "misty forest morning"

This is genuinely how diffusion models work. Real models (Stable Diffusion, DALL-E, Midjourney) do exactly this — start with noise, denoise step by step, guided by your text prompt. The math is heavier (each "denoising step" is a neural network pass), but the concept is identical to what you just saw. The "creativity" is really just a learned ability to recognize what shouldn't be in the noise.

Module 05 complete

You just ran three AIs
that were impossible a decade ago.

Real object detection. Real speech recognition. A real diffusion process. All inside a single tab. That tells you something about where we are — and where this is going.

Up next · Module 06

When AI gets it wrong

Bias, hallucinations, deepfakes, and the limits of today's AI — with a "Real or AI?" guessing game where most people score worse than chance.

Continue to Module 06

AI that sees,hears,& creates.