Module 1 showed you where attention lives. This module shows you what it does — the actual matrix multiplications behind the famous formula. We'll dissect it step by step with real numbers updating in real time. By the end you'll be able to explain Q, K, V, the √d scaling, and the causal mask to someone else. The single most important equation in modern AI, finally not opaque.
Crack it openBefore the math, the intuition. Each token in attention produces three vectors via three different learned projection matrices. The names — Query, Key, Value — come from an information-retrieval analogy that actually maps cleanly onto what the math does.
Each token's Query represents what kind of information that token wants from other tokens.
Example: the token "it" might have a Query like "find a noun I could refer to."
Each token's Key represents what kind of information it provides to others who might query it.
Example: "cat" might have a Key like "I'm a noun, animate, singular subject."
Each token's Value is the content that gets passed along when another token attends to it.
Q and K decide where to look. V is what comes back from looking there.
A real attention computation with 3 tokens and a 4-dim head. Step through the 6 stages of the equation, see the matrices update with actual numbers at each step, and watch parts of the formula light up as we use them.
The example: 3 tokens ("The", "cat", "sat") with 4-dimensional embeddings. Each matrix below is computed with real arithmetic — you can verify any cell with a calculator. Click Next step → to advance through the equation. The formula bar at top tracks which part of the equation is active.
The scaled-dot-product attention paper introduced one of the most subtly important details: dividing by √d before softmax. Skip this step and your model can't train at large dimensions. Below: see exactly what happens.
Imagine a single query attending to 8 keys, each a random vector of dimension d. As d grows, the raw dot products Q·K grow proportionally (variance ≈ d). Without scaling, softmax becomes one-hot — gradients vanish — model fails to learn. Pick a d below and watch the difference.
A decoder-only model (GPT, Claude, Llama) generates one token at a time. During training, we feed the whole sentence in at once — for parallelism — but each position must only see tokens that come before it. Otherwise the model would just "look at the answer" while predicting it. The fix: a causal mask.
Right before the softmax, set the upper triangle of the score matrix to -∞. After softmax, those positions become 0 — meaning "don't attend to future tokens." Toggle below to see the effect on the same 5-token sequence.
-∞, which means they'll be ignored by softmaxPast the math, here's what to actually remember. These three statements about attention recur in every paper, blog post, and engineering decision in modern AI.
Unlike convolutions (fixed filter shapes) or RNNs (fixed time-step routing), attention computes which connections to make on the fly, per input. Different inputs → different attention patterns.
This is why attention is so expressive: the "wiring" of the network changes with the data.
Computing QKᵀ produces an n × n matrix. For a 100k-token context, that's 10 billion attention scores. Per layer. Per head.
This quadratic scaling is the reason long context is hard, and why so much research goes into efficient attention (Flash Attention, sliding window, sparse). We cover this in Module 07.
It's easy to miss: in self-attention, both Q and K come from the same input. Each token plays both querier and queried. That's why it's called self-attention.
Cross-attention (used in encoder-decoder models, also in diffusion models) is different: Q comes from one source, K and V from another. Same equation, different inputs.
Without it, models with large head dimensions literally cannot train. The softmax saturates, gradients vanish, and learning halts. You saw this happen above.
This is one of those tiny, papers-overlook-it-but-it's-critical details. Every modern attention implementation includes it. You should too.
Aim for 4/5. Wrong answers explain themselves.
You can read softmax(QKᵀ/√d) · V and know what every symbol means. You can explain Q, K, V to someone else with a useful analogy. You felt why √d matters and how causal masking works. That's the entire mechanical foundation of every modern LLM — now unpacked.