Course 3 · Module 02 · 80 minutes

One equation.
That's the whole
mechanism.

Module 1 showed you where attention lives. This module shows you what it does — the actual matrix multiplications behind the famous formula. We'll dissect it step by step with real numbers updating in real time. By the end you'll be able to explain Q, K, V, the √d scaling, and the causal mask to someone else. The single most important equation in modern AI, finally not opaque.

You'll walk through

6 matrix steps

You'll see why

We divide by √d

You'll feel

Causal masking work

Crack it open

// The equation

Attention(Q,K,V) =
softmax(QKᵀ / √d) V

Q · queries

K · keys

V · values

d · head dim

Part 01 · The intuition

Three vectors per token.
Three jobs.

Before the math, the intuition. Each token in attention produces three vectors via three different learned projection matrices. The names — Query, Key, Value — come from an information-retrieval analogy that actually maps cleanly onto what the math does.

// Query

"What am I looking for?"

Each token's Query represents what kind of information that token wants from other tokens.

Example: the token "it" might have a Query like "find a noun I could refer to."

Q = X · W_Q

// Key

"What do I offer?"

Each token's Key represents what kind of information it provides to others who might query it.

Example: "cat" might have a Key like "I'm a noun, animate, singular subject."

K = X · W_K

// Value

"What do I actually share?"

Each token's Value is the content that gets passed along when another token attends to it.

Q and K decide where to look. V is what comes back from looking there.

V = X · W_V

The library analogy

Imagine each token is a person in a library. Their Query is the search slip they fill out. Other tokens' Keys are the index cards on every shelf. The token compares its query to every key — finding the strongest matches — then pulls down those shelves' Values and combines them. The output is a weighted summary of everything that matched. Every token does this for every other token. Simultaneously.

Part 02 · Hands on · Watch the math happen

Walk through the equation.
One step at a time.

A real attention computation with 3 tokens and a 4-dim head. Step through the 6 stages of the equation, see the matrices update with actual numbers at each step, and watch parts of the formula light up as we use them.

How to use it.

The example: 3 tokens ("The", "cat", "sat") with 4-dimensional embeddings. Each matrix below is computed with real arithmetic — you can verify any cell with a calculator. Click Next step → to advance through the equation. The formula bar at top tracks which part of the equation is active.

// Where we are in the equation

Attention(Q,K,V) = softmax(QKᵀ / √d) V

Step 1 of 6

Part 03 · Hands on · The √d explainer

Why divide by √d?
Because softmax saturates.

The scaled-dot-product attention paper introduced one of the most subtly important details: dividing by √d before softmax. Skip this step and your model can't train at large dimensions. Below: see exactly what happens.

The setup.

Imagine a single query attending to 8 keys, each a random vector of dimension d. As d grows, the raw dot products Q·K grow proportionally (variance ≈ d). Without scaling, softmax becomes one-hot — gradients vanish — model fails to learn. Pick a d below and watch the difference.

Pick d:

√d = 4.00

Without √d scaling

// softmax(Q · Kᵀ)

Max prob—

Entropy—

What goes wrong —

With √d scaling

// softmax(Q · Kᵀ / √d)

Max prob—

Entropy—

What stays right —

Part 04 · Hands on · The causal mask

How an LLM
doesn't cheat.

A decoder-only model (GPT, Claude, Llama) generates one token at a time. During training, we feed the whole sentence in at once — for parallelism — but each position must only see tokens that come before it. Otherwise the model would just "look at the answer" while predicting it. The fix: a causal mask.

The mechanic.

Right before the softmax, set the upper triangle of the score matrix to -∞. After softmax, those positions become 0 — meaning "don't attend to future tokens." Toggle below to see the effect on the same 5-token sequence.

Decoder-only mode · each token only sees tokens before it · this is how every chat LLM is trained

Raw scores

// before mask

Q · Kᵀ / √d — every position knows about every position

+ Mask applied

// upper triangle → -∞

Future positions get a score of -∞, which means they'll be ignored by softmax

After softmax

// final attention weights

Each row sums to 1 over only its visible positions

Part 05 · Three takeaways

Three insights
worth memorizing.

Past the math, here's what to actually remember. These three statements about attention recur in every paper, blog post, and engineering decision in modern AI.

// Takeaway 01

Attention is dynamic routing.

Unlike convolutions (fixed filter shapes) or RNNs (fixed time-step routing), attention computes which connections to make on the fly, per input. Different inputs → different attention patterns.

This is why attention is so expressive: the "wiring" of the network changes with the data.

// Self-organizing wiring per input.

// Takeaway 02

Cost is O(n²) in sequence length.

Computing QKᵀ produces an n × n matrix. For a 100k-token context, that's 10 billion attention scores. Per layer. Per head.

This quadratic scaling is the reason long context is hard, and why so much research goes into efficient attention (Flash Attention, sliding window, sparse). We cover this in Module 07.

// The price of looking everywhere.

// Takeaway 03

Q and K are projections of the same X.

It's easy to miss: in self-attention, both Q and K come from the same input. Each token plays both querier and queried. That's why it's called self-attention.

Cross-attention (used in encoder-decoder models, also in diffusion models) is different: Q comes from one source, K and V from another. Same equation, different inputs.

// Self-attention vs cross-attention is just where Q and K come from.

// Takeaway 04

√d scaling is not optional.

Without it, models with large head dimensions literally cannot train. The softmax saturates, gradients vanish, and learning halts. You saw this happen above.

This is one of those tiny, papers-overlook-it-but-it's-critical details. Every modern attention implementation includes it. You should too.

// One line of code. The model lives or dies by it.

Course 3 · Module 02 complete

The equation is now
yours.

You can read softmax(QKᵀ/√d) · V and know what every symbol means. You can explain Q, K, V to someone else with a useful analogy. You felt why √d matters and how causal masking works. That's the entire mechanical foundation of every modern LLM — now unpacked.

Up next · Course 3 · Module 03

From Pre-training to RLHF

The architecture is settled. Now: how do you actually train one of these things to be useful? Pre-training objectives, instruction tuning, RLHF, DPO, Constitutional AI. Why a raw pre-trained model is not what you talk to — and what the post-training pipeline does to turn it into Claude or ChatGPT.

Continue to Module 03

One equation.That's the wholemechanism.

Three vectors per token.Three jobs.

Walk through the equation.One step at a time.

Why divide by √d?Because softmax saturates.

How an LLMdoesn't cheat.

Three insightsworth memorizing.