Course 3 · Module 11 · 85 minutes

Training
is half.
Serving
is the other.

Pretraining a frontier model costs $100M and happens once. Then it runs trillions of times. Every API call, every chat turn, every line of generated code is an inference. The economics of every LLM product collapse without inference optimization. This module covers the four tricks that make 200-token-per-second serving possible: KV cache, quantization, speculative decoding, paged attention.

You'll size

The KV cache

You'll watch

Speculative decoding

You'll grasp

Why vLLM exists

Speed it up

Part 01 · The two phases · very different problems

One uses the GPU's
math. The other uses its memory.

Inference has two distinct phases, and they bottleneck on different things. Optimizations that help one often don't help the other — sometimes they conflict. Understanding the split is the first step to thinking clearly about serving.

Prefill · once per request

Process the prompt

// compute-bound

For a 1000-token prompt, all 1000 tokens go through the model at once. One enormous matrix multiplication. The GPU's compute units (Tensor Cores) saturate — they're built for exactly this.

Prefill is fast per-token but expensive total: roughly O(N²) in attention, O(N) in FFN. A 32K context prefill can take several seconds even on an H100.

BottleneckCompute · FLOPs

GPU utilization80-95%

What helpsFlash Attention

Decode · per output token

Generate the next token

// memory-bound

For each output token, the model does one forward pass — but it has to read the entire KV cache from HBM. Almost no compute, almost all memory traffic. The Tensor Cores sit idle while data shuffles.

For a 500-token reply, that's 500 forward passes back-to-back. Each one is memory-bound, so the GPU's headline TFLOPS number is mostly irrelevant. What matters is HBM bandwidth.

BottleneckMemory bandwidth

GPU utilization15-30%

What helpsQuantization, batching

Part 02 · Hands on · The KV cache calculator

How much memory
does "remembering" cost?

For each token you generate, the model has to keep the key and value vectors from every attention layer. This is the KV cache — and it grows linearly with sequence length. At a few thousand tokens it's manageable. At 200K it dominates GPU memory. Quantization is the single biggest lever you have.

The formula.

KV cache size = 2 · num_layers · num_heads · head_dim · seq_len · bytes_per_value. The "2" is for K and V (both are cached). Everything else is model architecture except the last two — which is why quantization (changing bytes per value) and shorter context are the only ways to shrink it without changing the model.

// Model

// Sequence length

// Precision

—

// memory by precision · same model, same seq

Memory math

// for the current selection

—

Part 03 · Quantization · the precision spectrum

Sixteen bits or four:
the tradeoff matters.

Going from FP16 to INT4 cuts memory 4× and roughly doubles inference speed. But each step down trades some quality. The current sweet spots: FP8 for serving (matches FP16 quality on most workloads), NF4 for fine-tuning (works because most weights are near zero), INT4/AWQ for edge deployment (when memory is the limiting factor).

FP16 / BF16

// the baseline

The default for training and serving in 2020-2022. 2 bytes per param. BF16 has wider exponent range (better for training stability), FP16 has more mantissa precision.

Tradeoff Best quality. Largest memory. Used as the "ground truth" against which all quantized variants are compared.

INT8

// 2× smaller, near-lossless

16 distinct values per scale block, but with 8 bits you get 256 levels. 1 byte per param. Per-channel or per-token quantization keeps accuracy high. Standard for serving since 2022.

Tradeoff ~2% quality drop on most tasks. 2× smaller, often 1.5-2× faster. The reliable production choice.

FP8

// hardware-accelerated

Floating-point at 8 bits (with 4-bit exponent and 3-bit mantissa, typically). 1 byte per param. H100s have FP8 Tensor Cores — actual hardware acceleration. Quality essentially indistinguishable from FP16.

Tradeoff The new standard on H100s/H200s. ~2× throughput vs FP16. The current sweet spot for serving.

NF4 / INT4 / AWQ

// 4× smaller, mostly intact

Only 16 distinct values per block. 0.5 bytes per param. NF4 (used by QLoRA) and AWQ (used by serving stacks) both target the actual weight distribution. Quality drop ~3-5%.

Tradeoff 4× memory savings. Crucial for fitting big models on single GPUs. The choice when memory dominates everything.

Part 04 · Hands on · Speculative decoding

Two models. One verdict.
Mostly faster.

The key insight: verifying tokens is cheaper than generating them. Run a small "draft" model to propose 4-8 tokens, then run the big model once to verify all of them in parallel. Accept the prefix that the big model agrees with, reject the rest. Net effect: 2-3× more output tokens per big-model forward pass.

The protocol.

Step 1. Small draft model proposes K tokens (say K=4). Step 2. Big model runs ONE forward pass on those K positions, comparing each draft token to what it would have generated. Step 3. Accept the longest matching prefix. If the draft was right for the first 3 tokens but diverged on the 4th, accept tokens 1-3, replace token 4 with the big model's choice, discard the rest. Step 4. Repeat. Mathematically equivalent to autoregressive decoding — same output distribution, fewer big-model passes.

// prompt "The capital of France is"

Press play to watch the decoding loop

// Tokens generated

total output tokens

// Big-model passes

vs same number for vanilla

// Speedup

—

tokens per big-model pass

Part 05 · The throughput unlock · vLLM and PagedAttention

Memory fragmentation
was the real enemy.

Allocate KV cache like an OS pages memory.

Until 2023, serving stacks pre-allocated KV cache as one big contiguous block per request, sized for the maximum possible sequence length. Result: massive waste. A request that only generated 50 tokens still reserved memory for 4096.

vLLM's PagedAttention (Kwon et al., 2023) treats KV cache like virtual memory. Break it into small fixed-size pages (typically 16 tokens each). Each request gets a logical sequence of page pointers; actual physical pages can be anywhere in HBM.

The result is staggering: 2-24× higher throughput than naive batching. Most production serving stacks today (vLLM, TGI, TensorRT-LLM) use page-based KV cache.

Continuous batching is the other half. Instead of waiting for every request in a batch to finish before starting new ones, you swap requests in/out of the batch every decode step. The batch size stays full, even with wildly different sequence lengths. Combined with PagedAttention, it's how a single H100 can serve 100+ concurrent users.

// static vs continuous batching

Part 06 · What people actually use

Three stacks that
most of inference runs through.

If you're serving an LLM at scale in 2024, you're likely on one of these. Each picks different points on the throughput-latency-flexibility curve.

UC Berkeley · 2023

vLLM

// the open-source default

DistinctivePagedAttention + continuous batching

Throughput vs naive2-24× higher

SupportsLlama, Mistral, Qwen, ~50+ models

Multi-LoRA servingYes · hot-swap adapters

LicenseApache 2.0

The most-used open-source LLM serving stack. Drop-in OpenAI-compatible HTTP server in a single command. Used by individual researchers, startups, and some major labs. The PagedAttention paper made this whole space rethink memory management.

NVIDIA · 2023

TensorRT-LLM

// peak performance on NVIDIA hardware

DistinctiveHand-tuned NVIDIA kernels

Performance~1.5-2× vs vLLM

SupportsFP8 first-class · on H100s

TradeoffCompile per-model · slower iteration

Use caseProduction at scale

NVIDIA's official serving stack. Each model gets compiled into an optimized engine, with hand-tuned kernels for specific architectures. Fastest if you're serving one or two models at huge scale. Painful for rapid experimentation or many model variants.

UC Berkeley · 2024

SGLang

// programmable inference

DistinctiveRadixAttention · prefix caching

StrengthStructured outputs · agents

Speedup~3-5× on shared-prefix workloads

NicheTool-use heavy, multi-turn chat

LicenseApache 2.0

Designed for complex prompt structures: branching agent loops, parallel tool calls, structured JSON outputs. Caches KV cache for shared prefixes (system prompts, RAG context) automatically. The choice when you're serving agents rather than chat completions.

Hugging Face · 2023

TGI · Text Generation Inference

// HF's serving stack

DistinctiveTight HF Hub integration

StrengthEasy deployment of HF models

PerformanceComparable to vLLM

Best forStandard HF model serving

LicenseApache 2.0

Hugging Face's own serving stack, integrated with HF Hub. One-command deployment of any HF model with continuous batching, quantization (GPTQ, AWQ, BnB), and OpenAI-compatible API. Often the default for teams already deep in the HF ecosystem.

Course 3 · Module 11 complete

You can now read
a serving spec honestly.

You distinguished prefill from decode. You sized real KV caches. You watched speculative decoding accept and reject. You know why vLLM exists. When someone benchmarks "200 tokens/sec on a single H100," you can now ask the right next questions: at what batch size? at what context length? at what quantization? Those numbers don't exist in isolation.

Up next · Course 3 · Module 12

Vector Databases at Scale

Most production AI systems are RAG-based, not pure LLM calls. The retrieval layer matters as much as the generator. Vector databases are how you store and query billions of embeddings in milliseconds. HNSW, IVF, product quantization, hybrid search. Interactive: watch HNSW build its graph layer by layer and search through it.

Continue to Module 12

Trainingis half.Servingis the other.

One uses the GPU'smath. The other uses its memory.

Process the prompt

Generate the next token

How much memorydoes "remembering" cost?

Sixteen bits or four:the tradeoff matters.

FP16 / BF16

INT8

FP8

NF4 / INT4 / AWQ

Two models. One verdict.Mostly faster.

Memory fragmentationwas the real enemy.

Allocate KV cache like an OS pages memory.

Three stacks thatmost of inference runs through.

vLLM

TensorRT-LLM

SGLang

TGI · Text Generation Inference

Five questions on whatyou just accelerated.

You can now reada serving spec honestly.

Vector Databases at Scale

Training
is half.
Serving
is the other.

One uses the GPU's
math. The other uses its memory.

How much memory
does "remembering" cost?

Sixteen bits or four:
the tradeoff matters.

Two models. One verdict.
Mostly faster.

Memory fragmentation
was the real enemy.

Three stacks that
most of inference runs through.

Five questions on what
you just accelerated.

You can now read
a serving spec honestly.