A 1-million-token context window means one trillion attention scores per layer per head. That's enough to crush any GPU on Earth. Yet Gemini does 2M, Claude does 200k, even local models handle 128k. This module unpacks the bag of tricks: Flash Attention, sparse attention patterns, Mamba, RoPE scaling — the engineering that made long context actually work.
Climb the wallPick any sequence length from 1K to 2M tokens. Watch what happens to memory and FLOPs under standard attention vs sliding window vs Mamba-style linear models. The y-axis is logarithmic — and the gaps are still enormous.
Memory needed for the attention matrix at BF16 (2 bytes per score), per layer per attention head. A typical model has 32 heads × 32 layers = 1024 of these matrices. The numbers shown are per matrix. Multiply by ~1000 to get full-model memory if all matrices were materialized in HBM at once. H100 GPU has 80 GB HBM. You'll see where each approach hits the wall.
Standard attention computes QKᵀ, writes the n×n matrix to GPU memory (HBM), reads it back, applies softmax, writes again, multiplies by V, writes the output. The slow part isn't the arithmetic. It's the constant trips to HBM.
Flash Attention (Dao et al. 2022) noticed: the GPU has tiny but extremely fast on-chip SRAM. Tile Q, K, V into blocks small enough to fit there, compute attention block-by-block, accumulate the result. The full n² matrix is never materialized.
The output is mathematically identical to naïve attention — but 4-8× faster and uses ~10× less memory at long contexts. No accuracy tradeoff. Just smarter use of the hardware.
Every production transformer since 2023 uses Flash Attention (now in version 3). It's why 128K context is even possible without exotic tricks. Same equation, much better engineering.
Flash Attention helps but doesn't change the O(n²) asymptotics. At extreme lengths, you have to compute fewer scores. Below: 5 sparse patterns used in real models. Each renders as an N×N grid where bright cells = "this token attends to that token." Click any pattern to see the tradeoffs.
Each grid is a 24×24 visualization (real models use n in the thousands or millions). Row i, column j highlighted = "token i attends to token j." More cells = more compute. The active-cell count vs full attention (576) tells you the speedup. Click any pattern for the full story.
—
Mamba (2023) and its predecessor S4 don't compute QKᵀ at all. They keep a fixed-size hidden state that gets updated as tokens stream through. Read: n tokens, but only O(n·d) compute — strictly linear in sequence length.
The clever part: SSMs are selective. The recurrence weights depend on the input itself, so the state can "remember" relevant tokens far back. This makes them competitive with attention on many tasks while running 5-10× faster at long contexts.
The catch: no parallel training over the sequence. Each step depends on the previous. Mamba uses a clever "scan" algorithm to parallelize this on GPUs, but it's still trickier than vanilla attention.
Hybrid models (Jamba, Mamba-2) interleave Mamba layers with attention layers — Mamba for cheap long-range routing, attention for sharp focus. The most likely future is "attention plus something else," not "attention everywhere."
Even if you solve memory, there's a second hurdle. Models train on, say, 8K-token sequences. At inference, you feed in 200K. The position embeddings have never seen those positions. The model breaks — coherently for ~8K, then dissolves into gibberish. Three techniques fix this.
Trained on positions 0-8K. Want to use 32K. Linearly rescale positions: position 32000 in the new context becomes position 8000 in the old encoding. Position embeddings stay in the trained range.
Cheap, works decently, but loses precision: many distinct positions get squished into the same encoding.
RoPE rotates pairs of dimensions at different frequencies. Position interpolation squishes all of them equally. NTK-aware scaling squishes low-frequency dimensions more (they carry long-range info) and high-frequency ones less (they carry local info).
The result: better preservation of local detail at the cost of more compute.
Combines NTK-aware scaling with a small amount of fine-tuning on long sequences. The model learns to handle the rescaled positions without further degradation. State of the art for extending context windows of pretrained models.
Used in many open-weight extended-context models. The current standard recipe.
By 2024, every frontier lab was racing on context length. Different approaches, different tradeoffs, different "effective" vs "marketed" context.
Set the bar in early 2024. Architecture details mostly secret, but rumors suggest Ring Attention (split context across GPUs, exchange KV via a ring) plus Flash Attention. The needle-in-haystack performance at 1M is genuinely impressive — most models degrade well before their marketed limit.
Slightly shorter than Gemini but with very strong "needle in haystack" recall — Claude can reliably find a single sentence buried anywhere in 200K tokens. Quality of recall > raw window size matters for real-world use.
OpenAI didn't chase the longest-context race. 128K hits the sweet spot for most practical use cases (whole codebases, long documents) without the recall degradation that plagues 1M+ models. Pragmatic choice.
Mistral 7B was the first major open-weight model to ship with sliding-window attention as the default. Each token can only "see" the previous 4096 — but through stacked layers, information propagates further. Trades raw recall for speed and simplicity.
Aim for 4/5. Wrong answers explain themselves.
You saw the O(n²) wall directly. You understand why Flash Attention was so impactful (IO, not math). You compared 5 sparse patterns and know which models use which. You know what Mamba does differently. You can explain RoPE scaling. When you next read "this model has a 1M context window," you'll know to ask: with which attention pattern, with what effective recall, and at what cost.
Continue to Module 08