Course 3 · Module 01 · 90 minutes

The architecture
that ate
the world.

In 2017, a paper from Google with a deceptively simple title — Attention Is All You Need — proposed an architecture called the Transformer. By 2020 it had replaced every other approach. By 2023 it powered GPT-4, Claude, Gemini, every image generator, every code assistant. This module is what the transformer actually is — gear by gear, beam by beam — so you can reason about what comes next.

You'll explore

The architecture, live

You'll see

Attention patterns emerge

You'll grasp

Why it scaled

Start the deep dive

Part 01 · The before-and-after

What changed in 2017
was parallelism.

Before transformers, sequence modeling used RNNs and LSTMs — process one word at a time, carry state forward. It worked but couldn't scale: each token had to wait for the previous one. Transformers replaced this with attention, which lets every token look at every other token in parallel. That single change unlocked the scaling laws that gave us GPT-4.

// Before · RNN / LSTM

Sequential processing.

Words per passOne

Parallelism~Zero

Long-range linksLossy

GPU usageUnderutilized

Scaling ceiling~1B params

// Reads left-to-right, gets confused by long sentences, can't use modern GPUs

→

// After · Transformer

Fully parallel.

Words per passAll of them

ParallelismMaximal

Long-range linksDirect

GPU usageSaturated

Scaling ceiling2T+ params

// Every token sees every other token at once · scales to thousands of GPUs

// The paper that started everything

"Attention Is All You Need"

Vaswani et al. · NeurIPS 2017 · ~150,000 citations and counting

The original transformer was designed for machine translation. But the architecture turned out to be obscenely general. The same blueprint now handles language, code, images, audio, protein folding, robot control — anything that can be tokenized into a sequence. We're going to dissect that blueprint next.

Part 02 · Hands on · The blueprint

Click any block.
See what it does.

Below is the actual transformer architecture, simplified to its functional core. Click any colored block to learn what happens inside it. Then hit "Run inference" to watch data flow through, layer by layer.

The blueprint.

Read it top to bottom: tokens come in, get embedded, run through N identical transformer blocks (each = attention + feed-forward), then the output head predicts the next token. Most modern LLMs differ only in scale — same skeleton, more layers, wider matrices, more attention heads.

Hint: click any colored block above

// Select any block

Click a block

Each block does one specific job

Click any colored block in the diagram to see what it does, the math involved, and why it's necessary.

Then hit Run inference to watch a token flow through the entire architecture.

Part 03 · Hands on · Attention patterns

Each head learns
a different job.

A transformer doesn't have one attention mechanism — it has many running in parallel, each looking for different patterns. Below: type a sentence (or pick a preset), then switch between four representative attention "heads." You'll see each one paying attention to completely different things. This is the moment the architecture stops being abstract.

How to read it.

The heatmap is an N×N grid where N = number of tokens. Row i, column j shows how much token i attends to token j. Brighter = stronger attention. Each head has been hand-tuned to mimic patterns real transformers actually learn — local context, look-back, anchor tokens, and content-based linking.

// Try a preset sentence

Local context

attention[row][col] = how much row attends to col

// What this head is doing

Local context

Part 04 · Why it works

Four innovations.
That's the whole trick.

The transformer paper combined four specific ideas. Each was clever on its own. Together they made everything else possible.

Innovation 01

Self-attention

Q · Kᵀ / √d · softmax · V

What it does: for each token, compute a weighted sum of all other tokens, where the weights depend on how related each pair is. No recursion, no time-stepping — every position can attend to every position in one matrix multiplication.

Why it changed everything: for the first time, a model could see "this word here in the question" and "that word there at the start of the document" at the same time, with equal effort. Distance stopped mattering.

// Distance became free.

Innovation 02

Multi-head

h heads in parallel

What it does: instead of one big attention computation, split into h smaller ones (typically 8-64), each with its own Q/K/V projections. Each head can specialize in different patterns — syntactic, positional, semantic, coreference, etc.

Why it works: a single attention mechanism can only express one pattern at a time. Multi-head lets the model attend to grammatical structure AND content links AND positional cues all at once. You just saw this in the visualizer above.

// One model, many specialists.

Innovation 03

Positional Encoding

sinusoids · RoPE · ALiBi

The problem it solves: attention has no built-in concept of order. To attention, "the dog bit the man" and "the man bit the dog" look identical. Need to inject position somehow.

The fix: add a position-dependent vector to each token embedding. The original paper used sinusoidal functions. Modern models use rotary position embeddings (RoPE) for better extrapolation to long contexts.

// Order matters. Make sure it's encoded.

Innovation 04

Residual Connections

x → block(x) + x

What it does: the output of every sub-layer is added back to its input — so the model can "skip" any layer if it doesn't help. Combined with LayerNorm to keep activations stable.

Why it's essential: without residuals, you cannot train 100-layer networks. The gradient vanishes. With them, you can stack 1000+ layers and still train. This is what enables the depth (and therefore the capability) we see in modern LLMs.

// Depth without forgetting.

Part 05 · The family tree

One blueprint.
Three dialects.

The original transformer paper proposed an encoder-decoder design (for translation). Researchers quickly realized you could keep just one half. That choice shaped what kind of model you got.

Encoder-only

BERT-style

// BERT · RoBERTa · DeBERTa · ModernBERT

Keeps only the encoder stack. Each token can see all other tokens (no causal mask). Trained via masked language modeling — predict the hidden word from context.

Cannot generate text token-by-token. Excels at understanding: classification, embeddings, named entity recognition, semantic search.

Best for Understanding tasks — search, classification, NER, sentiment, embeddings

Decoder-only

GPT-style

// GPT-4 · Claude · Gemini · Llama · Mistral

Keeps only the decoder. Uses a causal mask — each token can only attend to itself and tokens before it. Trained via next-token prediction: given a prefix, predict the next token, autoregressively.

The dominant architecture for modern LLMs. Same blueprint at every scale from 1B to 1T+ parameters. Everything you chat with today is decoder-only.

Best for Generation — chat, code, creative writing, reasoning, agents

Encoder-Decoder

T5-style

// T5 · BART · Whisper · the original Transformer

Keeps both halves. Encoder processes the input bidirectionally, decoder generates output one token at a time while attending to the encoder. The original 2017 design.

Excellent for sequence-to-sequence tasks where you have a clean input and need to produce a different output: translation, summarization, speech-to-text.

Best for Seq2seq — translation, summarization, speech recognition

Course 3 · Module 01 complete

You now see
what powers all of them.

You can name every block in a transformer. You felt how attention heads specialize. You know why decoder-only models dominate chat. You understand parallelism as the actual key innovation — not attention by itself. That's the architectural foundation for everything else in Course 3.

Up next · Course 3 · Module 02

Attention Deep Dive

We zoom into the single most important sub-block: the attention mechanism itself. Q, K, V matrices. Scaled dot products. Why we divide by √d. Causal masks. Live computation walkthrough where you watch every matrix multiplication happen step by step.

Continue to Module 02

The architecturethat atethe world.

What changed in 2017was parallelism.

Sequential processing.

Fully parallel.

Click any block.See what it does.

Each head learnsa different job.

Four innovations.That's the whole trick.

Self-attention

Multi-head

Positional Encoding

Residual Connections

One blueprint.Three dialects.

BERT-style

GPT-style

T5-style

Five questions on whatyou just dissected.

You now seewhat powers all of them.

Attention Deep Dive

The architecture
that ate
the world.

What changed in 2017
was parallelism.

Click any block.
See what it does.

Each head learns
a different job.

Four innovations.
That's the whole trick.

One blueprint.
Three dialects.

Five questions on what
you just dissected.

You now see
what powers all of them.