Diffusion models generate images by running noise removal in reverse. Train a neural network to predict the noise in a noisy image. Then start from pure static and ask it to denoise — step by step by step — until something coherent emerges. The math is surprisingly simple. The results power Stable Diffusion, DALL-E 3, Midjourney, Flux. This module unpacks how that hallucination-from-static actually works.
Begin denoisingA diffusion model has a forward process you never run at inference, and a reverse process you always do. The forward process is fixed (no learning). The reverse process is the entire model.
Take a clean image. Add a tiny amount of Gaussian noise. Repeat ~1000 times. By step T, the image is pure Gaussian noise — every pixel independent of every other.
This process has no learnable parameters. It's just a fixed math operation. We do it during training, not inference. Its purpose: generate training examples of (clean image, noisy image, amount of noise added).
Train a neural network to predict the noise in any noisy image. At inference, start with pure noise and ask the network "what noise is in this?" Subtract a fraction of that prediction. Repeat.
After ~50-1000 steps, you have a coherent image. The network never sees the clean target during inference — it just gets really good at "this is what the noise looks like" → subtract → iterate.
Pick a subject. Drag the timestep slider. As t grows from 0 to 1000, more and more Gaussian noise gets blended in. The math uses the actual cosine schedule from Improved DDPM (Nichol & Dhariwal, 2021).
Each pixel at timestep t is √α̅_t · original + √(1−α̅_t) · gaussian_noise. At t=0 the image is untouched. At t=1000 the original is gone — you're looking at pure standard-normal noise. The noise pattern itself is deterministic per subject — you're literally watching the same noise sample get amplified.
Now flip it. Start from pure Gaussian noise and run the denoising sequence in reverse. Hit play to animate the full 50-step DDIM sampling that real models use. This is what every image generator does at inference.
At each step the model would predict the noise present at the current timestep, subtract a portion of it, and pass the result forward. We're simulating that here: starting at t=1000 (pure noise) and stepping down to t=0 in 50 steps. In a real model, every one of these steps is one forward pass through a U-Net.
Naïve approach: train a network to map noisy image → clean image. Doesn't work well. The mapping is one-to-many (many clean images could explain the same noisy patch), so the network learns blurry averages.
Smart approach: train the network to predict the noise itself. At training time we know exactly what noise was added (we generated it). So this is a clean supervised regression problem — predict ε from (x_t, t).
At inference, we use the predicted noise to subtract a fraction at a time. The cleverness: same network, every step, conditioned only on the current timestep t. The network learns "what does noise at timestep 500 look like?" and applies that knowledge stepwise.
One small architecture, used 50-1000 times, makes pictures. The training objective fits on one line: L = ‖ε − ε̂(x_t, t)‖². That's it. That's the whole thing.
Doing diffusion directly on pixel-space images is brutally expensive. A 512×512 image has ~786,000 dimensions per channel. The U-Net has to operate on that for 50+ steps. Original DALL-E 2 did this — and needed a data-center to run.
The 2022 breakthrough (Rombach et al., "Latent Diffusion Models"): diffuse in a compressed latent space instead of pixel space. A pretrained autoencoder squashes images down to a ~48×48×4 latent — about 48× smaller. Do all the diffusion there. Decode at the very end.
Suddenly: diffusion fits in 8GB of VRAM. Stable Diffusion runs on a consumer GPU. Image generation went from cloud-only to laptop-feasible overnight.
Crucially, this also gave us text conditioning: the text encoder (CLIP, or T5 in newer models) injects prompt embeddings into the U-Net via cross-attention at each step. The prompt "guides" the denoising in the latent space.
The model can run with or without the prompt. CFG is the trick of running it both ways and extrapolating in the direction the prompt points. CFG scale controls how aggressively. Below: feel the tradeoff.
At each denoising step: ε̂_guided = ε̂(uncond) + scale · (ε̂(cond) − ε̂(uncond)). When scale=1, you get standard conditional generation. When scale=0, the prompt is ignored. When scale > 1, you push the denoising more strongly in the prompt direction. Drag the slider to see the regimes.
Model is barely listening to the prompt. Result: generally aesthetic but off-topic. Might be any landscape, or even no landscape. Used intentionally for "wild card" generation.
Model balances prompt-following with sample diversity. 7.5 is the default for Stable Diffusion. Mountains, lake, golden light — all present, all looking natural. The right answer 95% of the time.
Model is forced to maximize prompt adherence at every step. Result: oversaturated colors, hard edges, "fried" look. Every pixel screams "mountain lake!" Often artifact-heavy. Use sparingly.
Every major image generator since 2022 is a diffusion model. They differ in latent space size, text encoder, sampling algorithm, and post-training, but the core math is the same.
The model that made image generation accessible. Released open-weight in August 2022, immediately spawned an entire ecosystem (LoRAs, ControlNets, ComfyUI). SD3 / SDXL / Flux are direct descendants.
The leap from DALL-E 2 wasn't a bigger model — it was training on captions generated by GPT-4V instead of human-written ones. Captions became 10× more detailed, so the model learned much finer-grained text-to-image alignment.
From the team that built Stable Diffusion 1, 2, and XL. Rectified flow is a successor to diffusion — same idea, straighter trajectories from noise to image. Took the open-source quality crown in late 2024.
The aesthetics moat. Same diffusion architecture as competitors — but trained extensively on preferences for "looks good". The result: outputs that feel curated by default, where SD outputs feel raw. Closer to a personality-tuned model than a base one.
Aim for 4/5. Wrong answers explain themselves.
You watched a real cosine-schedule forward diffusion dissolve an image into pure Gaussian noise. You watched DDIM sampling reverse it. You understand the brilliant inversion at the heart of it — predict the noise, not the image. You know why Stable Diffusion fits on a laptop (latent space) and how prompts steer denoising (CFG). The entire image-generation pipeline, demystified.
Continue to Module 06