AI Skill Course Course 3 · Expert
Module 01 of 14
Course 3 · Module 01 · 90 minutes

The architecture
that ate
the world.

In 2017, a paper from Google with a deceptively simple title — Attention Is All You Need — proposed an architecture called the Transformer. By 2020 it had replaced every other approach. By 2023 it powered GPT-4, Claude, Gemini, every image generator, every code assistant. This module is what the transformer actually is — gear by gear, beam by beam — so you can reason about what comes next.

You'll explore
The architecture, live
You'll see
Attention patterns emerge
You'll grasp
Why it scaled
Start the deep dive
Attention patterns The cat sat on the mat Each token decides which other tokens to "pay attention to" "sat" looks back at "cat" (subject) · "mat" looks back at "cat" (object link)
Part 01 · The before-and-after

What changed in 2017
was parallelism.

Before transformers, sequence modeling used RNNs and LSTMs — process one word at a time, carry state forward. It worked but couldn't scale: each token had to wait for the previous one. Transformers replaced this with attention, which lets every token look at every other token in parallel. That single change unlocked the scaling laws that gave us GPT-4.

// Before · RNN / LSTM

Sequential processing.

Words per passOne
Parallelism~Zero
Long-range linksLossy
GPU usageUnderutilized
Scaling ceiling~1B params
// Reads left-to-right, gets confused by long sentences, can't use modern GPUs
// After · Transformer

Fully parallel.

Words per passAll of them
ParallelismMaximal
Long-range linksDirect
GPU usageSaturated
Scaling ceiling2T+ params
// Every token sees every other token at once · scales to thousands of GPUs
// The paper that started everything
"Attention Is All You Need"
Vaswani et al. · NeurIPS 2017 · ~150,000 citations and counting

The original transformer was designed for machine translation. But the architecture turned out to be obscenely general. The same blueprint now handles language, code, images, audio, protein folding, robot control — anything that can be tokenized into a sequence. We're going to dissect that blueprint next.

Part 02 · Hands on · The blueprint

Click any block.
See what it does.

Below is the actual transformer architecture, simplified to its functional core. Click any colored block to learn what happens inside it. Then hit "Run inference" to watch data flow through, layer by layer.

The blueprint.

Read it top to bottom: tokens come in, get embedded, run through N identical transformer blocks (each = attention + feed-forward), then the output head predicts the next token. Most modern LLMs differ only in scale — same skeleton, more layers, wider matrices, more attention heads.

residual residual Input tokens "The cat sat on the mat" Token Embedding vocab_size × d_model lookup + Positional Encoding tells the model "where" each token is N × Transformer Block (e.g. 96 layers in GPT-4) Multi-Head Attention tokens look at other tokens · the magic Add & LayerNorm residual + normalize Feed-Forward Net expand to 4× width · think · project back Add & LayerNorm residual + normalize LM Head + Softmax project to vocab · pick next token
Hint: click any colored block above
// Select any block
Click a block
Each block does one specific job

Click any colored block in the diagram to see what it does, the math involved, and why it's necessary.

Then hit Run inference to watch a token flow through the entire architecture.

Part 03 · Hands on · Attention patterns

Each head learns
a different job.

A transformer doesn't have one attention mechanism — it has many running in parallel, each looking for different patterns. Below: type a sentence (or pick a preset), then switch between four representative attention "heads." You'll see each one paying attention to completely different things. This is the moment the architecture stops being abstract.

How to read it.

The heatmap is an N×N grid where N = number of tokens. Row i, column j shows how much token i attends to token j. Brighter = stronger attention. Each head has been hand-tuned to mimic patterns real transformers actually learn — local context, look-back, anchor tokens, and content-based linking.

// Try a preset sentence
Local context
attention[row][col] = how much row attends to col
// What this head is doing
Local context
Part 04 · Why it works

Four innovations.
That's the whole trick.

The transformer paper combined four specific ideas. Each was clever on its own. Together they made everything else possible.

Innovation 01

Self-attention

Q · Kᵀ / √d · softmax · V

What it does: for each token, compute a weighted sum of all other tokens, where the weights depend on how related each pair is. No recursion, no time-stepping — every position can attend to every position in one matrix multiplication.

Why it changed everything: for the first time, a model could see "this word here in the question" and "that word there at the start of the document" at the same time, with equal effort. Distance stopped mattering.

// Distance became free.
Innovation 02

Multi-head

h heads in parallel

What it does: instead of one big attention computation, split into h smaller ones (typically 8-64), each with its own Q/K/V projections. Each head can specialize in different patterns — syntactic, positional, semantic, coreference, etc.

Why it works: a single attention mechanism can only express one pattern at a time. Multi-head lets the model attend to grammatical structure AND content links AND positional cues all at once. You just saw this in the visualizer above.

// One model, many specialists.
Innovation 03

Positional Encoding

sinusoids · RoPE · ALiBi

The problem it solves: attention has no built-in concept of order. To attention, "the dog bit the man" and "the man bit the dog" look identical. Need to inject position somehow.

The fix: add a position-dependent vector to each token embedding. The original paper used sinusoidal functions. Modern models use rotary position embeddings (RoPE) for better extrapolation to long contexts.

// Order matters. Make sure it's encoded.
Innovation 04

Residual Connections

x → block(x) + x

What it does: the output of every sub-layer is added back to its input — so the model can "skip" any layer if it doesn't help. Combined with LayerNorm to keep activations stable.

Why it's essential: without residuals, you cannot train 100-layer networks. The gradient vanishes. With them, you can stack 1000+ layers and still train. This is what enables the depth (and therefore the capability) we see in modern LLMs.

// Depth without forgetting.
Part 05 · The family tree

One blueprint.
Three dialects.

The original transformer paper proposed an encoder-decoder design (for translation). Researchers quickly realized you could keep just one half. That choice shaped what kind of model you got.

Encoder-only

BERT-style

// BERT · RoBERTa · DeBERTa · ModernBERT

Keeps only the encoder stack. Each token can see all other tokens (no causal mask). Trained via masked language modeling — predict the hidden word from context.

Cannot generate text token-by-token. Excels at understanding: classification, embeddings, named entity recognition, semantic search.

Best for Understanding tasks — search, classification, NER, sentiment, embeddings
Decoder-only

GPT-style

// GPT-4 · Claude · Gemini · Llama · Mistral

Keeps only the decoder. Uses a causal mask — each token can only attend to itself and tokens before it. Trained via next-token prediction: given a prefix, predict the next token, autoregressively.

The dominant architecture for modern LLMs. Same blueprint at every scale from 1B to 1T+ parameters. Everything you chat with today is decoder-only.

Best for Generation — chat, code, creative writing, reasoning, agents
Encoder-Decoder

T5-style

// T5 · BART · Whisper · the original Transformer

Keeps both halves. Encoder processes the input bidirectionally, decoder generates output one token at a time while attending to the encoder. The original 2017 design.

Excellent for sequence-to-sequence tasks where you have a clean input and need to produce a different output: translation, summarization, speech-to-text.

Best for Seq2seq — translation, summarization, speech recognition
Part 06 · Knowledge check

Five questions on what
you just dissected.

Aim for 4/5. Wrong answers explain themselves.

Question 01 of 05

0/5

Continue
Course 3 · Module 01 complete

You now see
what powers all of them.

You can name every block in a transformer. You felt how attention heads specialize. You know why decoder-only models dominate chat. You understand parallelism as the actual key innovation — not attention by itself. That's the architectural foundation for everything else in Course 3.

Up next · Course 3 · Module 02

Attention Deep Dive

We zoom into the single most important sub-block: the attention mechanism itself. Q, K, V matrices. Scaled dot products. Why we divide by √d. Causal masks. Live computation walkthrough where you watch every matrix multiplication happen step by step.

Continue to Module 02