In 2017, a paper from Google with a deceptively simple title — Attention Is All You Need — proposed an architecture called the Transformer. By 2020 it had replaced every other approach. By 2023 it powered GPT-4, Claude, Gemini, every image generator, every code assistant. This module is what the transformer actually is — gear by gear, beam by beam — so you can reason about what comes next.
Start the deep diveBefore transformers, sequence modeling used RNNs and LSTMs — process one word at a time, carry state forward. It worked but couldn't scale: each token had to wait for the previous one. Transformers replaced this with attention, which lets every token look at every other token in parallel. That single change unlocked the scaling laws that gave us GPT-4.
The original transformer was designed for machine translation. But the architecture turned out to be obscenely general. The same blueprint now handles language, code, images, audio, protein folding, robot control — anything that can be tokenized into a sequence. We're going to dissect that blueprint next.
Below is the actual transformer architecture, simplified to its functional core. Click any colored block to learn what happens inside it. Then hit "Run inference" to watch data flow through, layer by layer.
Read it top to bottom: tokens come in, get embedded, run through N identical transformer blocks (each = attention + feed-forward), then the output head predicts the next token. Most modern LLMs differ only in scale — same skeleton, more layers, wider matrices, more attention heads.
Click any colored block in the diagram to see what it does, the math involved, and why it's necessary.
Then hit Run inference to watch a token flow through the entire architecture.
A transformer doesn't have one attention mechanism — it has many running in parallel, each looking for different patterns. Below: type a sentence (or pick a preset), then switch between four representative attention "heads." You'll see each one paying attention to completely different things. This is the moment the architecture stops being abstract.
The heatmap is an N×N grid where N = number of tokens. Row i, column j shows how much token i attends to token j. Brighter = stronger attention. Each head has been hand-tuned to mimic patterns real transformers actually learn — local context, look-back, anchor tokens, and content-based linking.
The transformer paper combined four specific ideas. Each was clever on its own. Together they made everything else possible.
What it does: for each token, compute a weighted sum of all other tokens, where the weights depend on how related each pair is. No recursion, no time-stepping — every position can attend to every position in one matrix multiplication.
Why it changed everything: for the first time, a model could see "this word here in the question" and "that word there at the start of the document" at the same time, with equal effort. Distance stopped mattering.
What it does: instead of one big attention computation, split into h smaller ones (typically 8-64), each with its own Q/K/V projections. Each head can specialize in different patterns — syntactic, positional, semantic, coreference, etc.
Why it works: a single attention mechanism can only express one pattern at a time. Multi-head lets the model attend to grammatical structure AND content links AND positional cues all at once. You just saw this in the visualizer above.
The problem it solves: attention has no built-in concept of order. To attention, "the dog bit the man" and "the man bit the dog" look identical. Need to inject position somehow.
The fix: add a position-dependent vector to each token embedding. The original paper used sinusoidal functions. Modern models use rotary position embeddings (RoPE) for better extrapolation to long contexts.
What it does: the output of every sub-layer is added back to its input — so the model can "skip" any layer if it doesn't help. Combined with LayerNorm to keep activations stable.
Why it's essential: without residuals, you cannot train 100-layer networks. The gradient vanishes. With them, you can stack 1000+ layers and still train. This is what enables the depth (and therefore the capability) we see in modern LLMs.
The original transformer paper proposed an encoder-decoder design (for translation). Researchers quickly realized you could keep just one half. That choice shaped what kind of model you got.
Keeps only the encoder stack. Each token can see all other tokens (no causal mask). Trained via masked language modeling — predict the hidden word from context.
Cannot generate text token-by-token. Excels at understanding: classification, embeddings, named entity recognition, semantic search.
Keeps only the decoder. Uses a causal mask — each token can only attend to itself and tokens before it. Trained via next-token prediction: given a prefix, predict the next token, autoregressively.
The dominant architecture for modern LLMs. Same blueprint at every scale from 1B to 1T+ parameters. Everything you chat with today is decoder-only.
Keeps both halves. Encoder processes the input bidirectionally, decoder generates output one token at a time while attending to the encoder. The original 2017 design.
Excellent for sequence-to-sequence tasks where you have a clean input and need to produce a different output: translation, summarization, speech-to-text.
Aim for 4/5. Wrong answers explain themselves.
You can name every block in a transformer. You felt how attention heads specialize. You know why decoder-only models dominate chat. You understand parallelism as the actual key innovation — not attention by itself. That's the architectural foundation for everything else in Course 3.
Continue to Module 02