Three real AI models. Running in your browser. Right now. Look into your webcam — the AI will see you. Talk into your mic — it'll transcribe you. Pick a prompt — watch it generate from pure noise. Welcome to the part of the course that feels like magic.
See the magic// Computer vision
// Speech-to-text · noise cancel
// Images · text · sound · video
Every image is just a grid of numbers (brightness values). Computer vision is the art of finding patterns in those numbers — patterns that mean "cat" or "stop sign" or "tumor." Three levels of skill:
"This image contains a cat." One label for the whole image. The simplest form of vision AI — and where the modern boom started in 2012.
"There's a cat here, a dog there, a couch in the back." Multiple objects, each with a box around it. This is what the model below does.
"These exact pixels are cat-pixels. These are sofa-pixels." Used in medical imaging, self-driving cars, and any task that needs precise shape.
Real object-detection AI runs in your browser. Click below and point your camera at anything. It'll find people, objects, animals — 80 categories trained on millions of images.
Every sound is a wave. Speech-to-text turns those waves into text by breaking them into tiny chunks, identifying phonemes (the building blocks of speech), and stringing them into words. Try it yourself.
Tools like DALL-E and Midjourney don't paint. They denoise. They start with a square of random pixels and gradually subtract noise until an image appears. Sounds impossible. Watch:
Pure random noise. No structure. No meaning. This is where every generated image starts.
Wrong answers explain themselves. Aim for 4/5 to move on.
Real object detection. Real speech recognition. A real diffusion process. All inside a single tab. That tells you something about where we are — and where this is going.
Continue to Module 06