AI Skill Course Course 3 · Expert
Module 05 of 14
Course 3 · Module 05 · 90 minutes

From pure noise.
To photographs.

Diffusion models generate images by running noise removal in reverse. Train a neural network to predict the noise in a noisy image. Then start from pure static and ask it to denoise — step by step by step — until something coherent emerges. The math is surprisingly simple. The results power Stable Diffusion, DALL-E 3, Midjourney, Flux. This module unpacks how that hallucination-from-static actually works.

You'll watch
Images dissolve to noise
You'll feel
Reverse generation
You'll see
Why prompts steer
Begin denoising
// reverse diffusion · 5 steps shown
t=1000
t=750
t=500
t=250
t=0
pure noise → denoising → clean image
Part 01 · Two processes

One pair of opposites.
That's the whole idea.

A diffusion model has a forward process you never run at inference, and a reverse process you always do. The forward process is fixed (no learning). The reverse process is the entire model.

// Forward process · fixed · no learning

Add noise.

x₀ → x₁ → x₂ → ... → x_T

Take a clean image. Add a tiny amount of Gaussian noise. Repeat ~1000 times. By step T, the image is pure Gaussian noise — every pixel independent of every other.

This process has no learnable parameters. It's just a fixed math operation. We do it during training, not inference. Its purpose: generate training examples of (clean image, noisy image, amount of noise added).

x_t = √α̅_t · x₀ + √(1−α̅_t) · ε
// Reverse process · learned · the whole model

Remove noise.

x_T → x_{T-1} → ... → x₁ → x₀

Train a neural network to predict the noise in any noisy image. At inference, start with pure noise and ask the network "what noise is in this?" Subtract a fraction of that prediction. Repeat.

After ~50-1000 steps, you have a coherent image. The network never sees the clean target during inference — it just gets really good at "this is what the noise looks like" → subtract → iterate.

predict ε with network · then x_{t-1} = denoise step using predicted ε
Part 02 · Hands on · Forward diffusion

Watch an image
dissolve.

Pick a subject. Drag the timestep slider. As t grows from 0 to 1000, more and more Gaussian noise gets blended in. The math uses the actual cosine schedule from Improved DDPM (Nichol & Dhariwal, 2021).

What you're seeing.

Each pixel at timestep t is √α̅_t · original + √(1−α̅_t) · gaussian_noise. At t=0 the image is untouched. At t=1000 the original is gone — you're looking at pure standard-normal noise. The noise pattern itself is deterministic per subject — you're literally watching the same noise sample get amplified.

// Pick a subject
t=0
// Drag the timestep
timestep t 0 / 1000
α̅_t (signal kept)1.0000
√α̅_t (signal multiplier)1.0000
√(1−α̅_t) (noise multiplier)0.0000
SNR (signal-to-noise)
The noise schedule Cosine schedule keeps more signal at early steps (model has plenty to learn from) and adds noise faster at later steps. Linear schedules waste capacity at the very-noisy end where everything looks similar anyway.
Part 03 · Hands on · Reverse diffusion

Watch an image
emerge.

Now flip it. Start from pure Gaussian noise and run the denoising sequence in reverse. Hit play to animate the full 50-step DDIM sampling that real models use. This is what every image generator does at inference.

What's happening.

At each step the model would predict the noise present at the current timestep, subtract a portion of it, and pass the result forward. We're simulating that here: starting at t=1000 (pure noise) and stepping down to t=0 in 50 steps. In a real model, every one of these steps is one forward pass through a U-Net.

// Pick the target image
t=1000
// Sample
noise · step 0 step 0 / 50 image · step 50
current timestep t1000
α̅_t (signal recovered)0.0000
total inference time (real model)~2.5s on H100
DDIM sampling DDPM (original) needed all 1000 steps. DDIM (2021) reformulated to allow non-Markovian sampling — same model, much fewer steps. 50 steps became standard. Modern samplers (DPM++, Euler-A) get down to ~20-25.
Part 04 · The training trick

Don't predict the image.
Predict the noise.

// training loop · per step
clean x₀ from dataset random t, ε t ~ U(0,T), ε ~ N(0,I) noisy x_t add forward noise U-Net predicts ε given (x_t, t) ~1B parameters · convolutional · self-attention predicted ε̂ network's best guess L = ‖ε − ε̂‖² mean-squared-error · backprop · repeat

The brilliant inversion.

Naïve approach: train a network to map noisy image → clean image. Doesn't work well. The mapping is one-to-many (many clean images could explain the same noisy patch), so the network learns blurry averages.

Smart approach: train the network to predict the noise itself. At training time we know exactly what noise was added (we generated it). So this is a clean supervised regression problem — predict ε from (x_t, t).

At inference, we use the predicted noise to subtract a fraction at a time. The cleverness: same network, every step, conditioned only on the current timestep t. The network learns "what does noise at timestep 500 look like?" and applies that knowledge stepwise.

One small architecture, used 50-1000 times, makes pictures. The training objective fits on one line: L = ‖ε − ε̂(x_t, t)‖². That's it. That's the whole thing.

Part 05 · Latent diffusion

Why Stable Diffusion
runs on your laptop.

The bottleneck of pixel-space diffusion.

Doing diffusion directly on pixel-space images is brutally expensive. A 512×512 image has ~786,000 dimensions per channel. The U-Net has to operate on that for 50+ steps. Original DALL-E 2 did this — and needed a data-center to run.

The 2022 breakthrough (Rombach et al., "Latent Diffusion Models"): diffuse in a compressed latent space instead of pixel space. A pretrained autoencoder squashes images down to a ~48×48×4 latent — about 48× smaller. Do all the diffusion there. Decode at the very end.

Suddenly: diffusion fits in 8GB of VRAM. Stable Diffusion runs on a consumer GPU. Image generation went from cloud-only to laptop-feasible overnight.

Crucially, this also gave us text conditioning: the text encoder (CLIP, or T5 in newer models) injects prompt embeddings into the U-Net via cross-attention at each step. The prompt "guides" the denoising in the latent space.

// latent diffusion pipeline
image · 512×512×3 ~786k dims VAE encode latent · 64×64×4 ~16k dims · 48× smaller Diffusion happens here U-Net (noise pred) text · CLIP/T5 cross-attention into U-Net VAE decode image · 512×512×3 final output 50× iterations
Part 06 · Hands on · Classifier-free guidance

How prompts steer
the denoising.

The model can run with or without the prompt. CFG is the trick of running it both ways and extrapolating in the direction the prompt points. CFG scale controls how aggressively. Below: feel the tradeoff.

The math.

At each denoising step: ε̂_guided = ε̂(uncond) + scale · (ε̂(cond) − ε̂(uncond)). When scale=1, you get standard conditional generation. When scale=0, the prompt is ignored. When scale > 1, you push the denoising more strongly in the prompt direction. Drag the slider to see the regimes.

// the prompt
"a serene mountain lake at golden hour, photorealistic"
CFG scale 7.0
// CFG 1.0 – 3.0
Creative drift

Model is barely listening to the prompt. Result: generally aesthetic but off-topic. Might be any landscape, or even no landscape. Used intentionally for "wild card" generation.

// CFG 5.0 – 9.0
Sweet spot

Model balances prompt-following with sample diversity. 7.5 is the default for Stable Diffusion. Mountains, lake, golden light — all present, all looking natural. The right answer 95% of the time.

// CFG 12.0 – 20.0
Burned / saturated

Model is forced to maximize prompt adherence at every step. Result: oversaturated colors, hard edges, "fried" look. Every pixel screams "mountain lake!" Often artifact-heavy. Use sparingly.

Part 07 · Diffusion in the wild

The models you've already
generated with.

Every major image generator since 2022 is a diffusion model. They differ in latent space size, text encoder, sampling algorithm, and post-training, but the core math is the same.

Stability AI · 2022

Stable Diffusion

// the open-source flagship
ArchitectureLatent diffusion
U-Net size~860M params
Text encoderCLIP (SD 1.5) → T5 (SD3+)
SamplingDDIM, DPM++, Euler-A
Runs onconsumer GPU (8GB+)

The model that made image generation accessible. Released open-weight in August 2022, immediately spawned an entire ecosystem (LoRAs, ControlNets, ComfyUI). SD3 / SDXL / Flux are direct descendants.

OpenAI · 2023

DALL-E 3

// the prompt-fidelity leader
ArchitectureLatent diffusion
Key innovationHighly detailed prompt rewrites
Text encoderHeavy (likely T5-XXL)
Training dataRe-captioned by GPT-4V
Runs onOpenAI servers only

The leap from DALL-E 2 wasn't a bigger model — it was training on captions generated by GPT-4V instead of human-written ones. Captions became 10× more detailed, so the model learned much finer-grained text-to-image alignment.

Black Forest Labs · 2024

FLUX.1

// the new open-source frontier
ArchitectureRectified-flow transformer
Size12B params
Sampling~20 steps (flow matching)
Text encoderT5-XXL + CLIP
Variants[dev], [schnell], [pro]

From the team that built Stable Diffusion 1, 2, and XL. Rectified flow is a successor to diffusion — same idea, straighter trajectories from noise to image. Took the open-source quality crown in late 2024.

Midjourney · 2022+

Midjourney

// the aesthetic specialist
ArchitectureLatent diffusion (closed)
Key edgePost-training on aesthetic data
Versionsv1 → v6.1
InterfaceDiscord, web
Runs onMidjourney servers

The aesthetics moat. Same diffusion architecture as competitors — but trained extensively on preferences for "looks good". The result: outputs that feel curated by default, where SD outputs feel raw. Closer to a personality-tuned model than a base one.

Part 08 · Knowledge check

Five questions on what
you just denoised.

Aim for 4/5. Wrong answers explain themselves.

Question 01 of 05

0/5

Continue
Course 3 · Module 05 complete

Static is just
an image waiting.

You watched a real cosine-schedule forward diffusion dissolve an image into pure Gaussian noise. You watched DDIM sampling reverse it. You understand the brilliant inversion at the heart of it — predict the noise, not the image. You know why Stable Diffusion fits on a laptop (latent space) and how prompts steer denoising (CFG). The entire image-generation pipeline, demystified.

Up next · Course 3 · Module 06

Multimodal Models

Diffusion is image-only. But what about models that handle text AND images AND audio AND video — all in one architecture? CLIP, LLaVA, Flamingo, GPT-4V, Gemini multimodal. The trick: encode every modality into a shared embedding space, then let cross-attention do the rest. Interactive: explore a multimodal embedding space.

Continue to Module 06