AI Skill Course Course 3 · Expert
Module 07 of 14
Course 3 · Module 07 · 95 minutes

Attention is O(n²).
So how is Gemini's
context 2M?

A 1-million-token context window means one trillion attention scores per layer per head. That's enough to crush any GPU on Earth. Yet Gemini does 2M, Claude does 200k, even local models handle 128k. This module unpacks the bag of tricks: Flash Attention, sparse attention patterns, Mamba, RoPE scaling — the engineering that made long context actually work.

You'll see
The quadratic wall
You'll compare
5 attention patterns
You'll grasp
How Mamba escapes O(n²)
Climb the wall
Attention scores per layer per head 1K ~1M 8K 64M 32K 1B 128K 16B 1M 1T 2M 4T 0 Doubling context → 4× the scores this is the wall every long-context model has to climb
Part 01 · Hands on · The cost explosion

Slide a context length.
Watch the math break.

Pick any sequence length from 1K to 2M tokens. Watch what happens to memory and FLOPs under standard attention vs sliding window vs Mamba-style linear models. The y-axis is logarithmic — and the gaps are still enormous.

What you're measuring.

Memory needed for the attention matrix at BF16 (2 bytes per score), per layer per attention head. A typical model has 32 heads × 32 layers = 1024 of these matrices. The numbers shown are per matrix. Multiply by ~1000 to get full-model memory if all matrices were materialized in HBM at once. H100 GPU has 80 GB HBM. You'll see where each approach hits the wall.

// Pick a context length
sequence length n 8K
Full attention
— scores · O(n²)
Sliding window (w=4096)
— scores · O(n·w)
Mamba / SSM
— state · O(n·d)
Part 02 · Flash Attention

Same math.
Smarter memory.

// GPU memory hierarchy
HBM (slow) 80 GB · ~3 TB/s bandwidth naïve attention writes n² scores here → memory-bound slow transfers SRAM (fast) ~256 KB per SM ~19 TB/s · 6× faster Flash compute here The Flash Attention insight: never write the n² matrix to HBM tile into SRAM-sized blocks compute incrementally same output · 4-8× faster · less memory

The first realization wasn't more math — it was IO.

Standard attention computes QKᵀ, writes the n×n matrix to GPU memory (HBM), reads it back, applies softmax, writes again, multiplies by V, writes the output. The slow part isn't the arithmetic. It's the constant trips to HBM.

Flash Attention (Dao et al. 2022) noticed: the GPU has tiny but extremely fast on-chip SRAM. Tile Q, K, V into blocks small enough to fit there, compute attention block-by-block, accumulate the result. The full n² matrix is never materialized.

The output is mathematically identical to naïve attention — but 4-8× faster and uses ~10× less memory at long contexts. No accuracy tradeoff. Just smarter use of the hardware.

Every production transformer since 2023 uses Flash Attention (now in version 3). It's why 128K context is even possible without exotic tricks. Same equation, much better engineering.

Part 03 · Hands on · Sparse attention patterns

If you can't afford every score,
be picky.

Flash Attention helps but doesn't change the O(n²) asymptotics. At extreme lengths, you have to compute fewer scores. Below: 5 sparse patterns used in real models. Each renders as an N×N grid where bright cells = "this token attends to that token." Click any pattern to see the tradeoffs.

Reading the grids.

Each grid is a 24×24 visualization (real models use n in the thousands or millions). Row i, column j highlighted = "token i attends to token j." More cells = more compute. The active-cell count vs full attention (576) tells you the speedup. Click any pattern for the full story.

Part 04 · The escape hatch · Mamba & SSMs

What if attention
just isn't the answer?

State Space Models replace attention with a recurrent state.

Mamba (2023) and its predecessor S4 don't compute QKᵀ at all. They keep a fixed-size hidden state that gets updated as tokens stream through. Read: n tokens, but only O(n·d) compute — strictly linear in sequence length.

The clever part: SSMs are selective. The recurrence weights depend on the input itself, so the state can "remember" relevant tokens far back. This makes them competitive with attention on many tasks while running 5-10× faster at long contexts.

The catch: no parallel training over the sequence. Each step depends on the previous. Mamba uses a clever "scan" algorithm to parallelize this on GPUs, but it's still trickier than vanilla attention.

Hybrid models (Jamba, Mamba-2) interleave Mamba layers with attention layers — Mamba for cheap long-range routing, attention for sharp focus. The most likely future is "attention plus something else," not "attention everywhere."

// SSM · recurrent state update
x₁ → x₂ → x₃ → x₄ → x₅ → ... x₁ x₂ x₃ x₄ x₅ h₁ h₂ h₃ h₄ h₅ y₁ y₂ y₃ y₄ y₅ h_t = A · h_{t-1} + B · x_t y_t = C · h_t Fixed-size state h carries everything forward no n² matrix · linear in sequence length
Part 05 · The other half · Position embeddings

A separate problem:
positions go out of range.

Even if you solve memory, there's a second hurdle. Models train on, say, 8K-token sequences. At inference, you feed in 200K. The position embeddings have never seen those positions. The model breaks — coherently for ~8K, then dissolves into gibberish. Three techniques fix this.

01

Position Interpolation

Squish positions back

Trained on positions 0-8K. Want to use 32K. Linearly rescale positions: position 32000 in the new context becomes position 8000 in the old encoding. Position embeddings stay in the trained range.

Cheap, works decently, but loses precision: many distinct positions get squished into the same encoding.

θ_new = θ_train · L_train / L_new
02

NTK-aware scaling

Frequency-dependent rescale

RoPE rotates pairs of dimensions at different frequencies. Position interpolation squishes all of them equally. NTK-aware scaling squishes low-frequency dimensions more (they carry long-range info) and high-frequency ones less (they carry local info).

The result: better preservation of local detail at the cost of more compute.

θ_i = base−2i/d · scale2i/d
03

YaRN

Best of both, fine-tuned

Combines NTK-aware scaling with a small amount of fine-tuning on long sequences. The model learns to handle the rescaled positions without further degradation. State of the art for extending context windows of pretrained models.

Used in many open-weight extended-context models. The current standard recipe.

NTK-scale + LoRA fine-tune on ~1B long-context tokens
Part 06 · Long context in the wild

The models that already
read the whole book.

By 2024, every frontier lab was racing on context length. Different approaches, different tradeoffs, different "effective" vs "marketed" context.

Google · 2024

Gemini 1.5 / 2

// the long-context leader
Marketed context1M (1.5) → 2M (2)
ArchitectureMoE + custom attention
Key trickRing attention + Flash + ??
Needle-in-haystack~99% at 1M tokens
Use casesEntire books · hours of video

Set the bar in early 2024. Architecture details mostly secret, but rumors suggest Ring Attention (split context across GPUs, exchange KV via a ring) plus Flash Attention. The needle-in-haystack performance at 1M is genuinely impressive — most models degrade well before their marketed limit.

Anthropic · 2024

Claude 3 / 3.5

// long context + strong recall
Marketed context200K
ArchitectureDense transformer + Flash
Effective recall~95%+ at 200K
Best forLong documents · code repos
Famous demoReading Pride and Prejudice

Slightly shorter than Gemini but with very strong "needle in haystack" recall — Claude can reliably find a single sentence buried anywhere in 200K tokens. Quality of recall > raw window size matters for real-world use.

OpenAI · 2024

GPT-4 Turbo / GPT-4o

// late but solid
Marketed context128K
ArchitectureMoE (rumored) + Flash
Effective recallStrong, degrades past 64K
ApproachIterative expansion + YaRN-style
GenerallyConservative on context

OpenAI didn't chase the longest-context race. 128K hits the sweet spot for most practical use cases (whole codebases, long documents) without the recall degradation that plagues 1M+ models. Pragmatic choice.

Mistral · 2024

Mistral / Mixtral

// sliding window pioneer
Marketed context32K – 128K
ArchitectureSliding window (w=4096)
Key trickNo full attention at all
TradeoffLocal context only
StrengthFast at long context

Mistral 7B was the first major open-weight model to ship with sliding-window attention as the default. Each token can only "see" the previous 4096 — but through stacked layers, information propagates further. Trades raw recall for speed and simplicity.

Part 07 · Knowledge check

Five questions on what
you just scaled.

Aim for 4/5. Wrong answers explain themselves.

Question 01 of 05

0/5

Continue
Course 3 · Module 07 complete

You can now read
a model's context number critically.

You saw the O(n²) wall directly. You understand why Flash Attention was so impactful (IO, not math). You compared 5 sparse patterns and know which models use which. You know what Mamba does differently. You can explain RoPE scaling. When you next read "this model has a 1M context window," you'll know to ask: with which attention pattern, with what effective recall, and at what cost.

Up next · Course 3 · Module 08

Tool Use & Function Calling

Long context is one way to give a model more capability. Tools are another: instead of stuffing the world into the prompt, let the model call out to APIs, search engines, calculators, databases. JSON schemas, parallel calls, agent loops. Interactive: watch a model decide when to call a tool, and route the response back into its context.

Continue to Module 08