Pretraining a frontier model costs $100M and happens once. Then it runs trillions of times. Every API call, every chat turn, every line of generated code is an inference. The economics of every LLM product collapse without inference optimization. This module covers the four tricks that make 200-token-per-second serving possible: KV cache, quantization, speculative decoding, paged attention.
Speed it upInference has two distinct phases, and they bottleneck on different things. Optimizations that help one often don't help the other — sometimes they conflict. Understanding the split is the first step to thinking clearly about serving.
For a 1000-token prompt, all 1000 tokens go through the model at once. One enormous matrix multiplication. The GPU's compute units (Tensor Cores) saturate — they're built for exactly this.
Prefill is fast per-token but expensive total: roughly O(N²) in attention, O(N) in FFN. A 32K context prefill can take several seconds even on an H100.
For each output token, the model does one forward pass — but it has to read the entire KV cache from HBM. Almost no compute, almost all memory traffic. The Tensor Cores sit idle while data shuffles.
For a 500-token reply, that's 500 forward passes back-to-back. Each one is memory-bound, so the GPU's headline TFLOPS number is mostly irrelevant. What matters is HBM bandwidth.
For each token you generate, the model has to keep the key and value vectors from every attention layer. This is the KV cache — and it grows linearly with sequence length. At a few thousand tokens it's manageable. At 200K it dominates GPU memory. Quantization is the single biggest lever you have.
KV cache size = 2 · num_layers · num_heads · head_dim · seq_len · bytes_per_value. The "2" is for K and V (both are cached). Everything else is model architecture except the last two — which is why quantization (changing bytes per value) and shorter context are the only ways to shrink it without changing the model.
Going from FP16 to INT4 cuts memory 4× and roughly doubles inference speed. But each step down trades some quality. The current sweet spots: FP8 for serving (matches FP16 quality on most workloads), NF4 for fine-tuning (works because most weights are near zero), INT4/AWQ for edge deployment (when memory is the limiting factor).
The default for training and serving in 2020-2022. 2 bytes per param. BF16 has wider exponent range (better for training stability), FP16 has more mantissa precision.
16 distinct values per scale block, but with 8 bits you get 256 levels. 1 byte per param. Per-channel or per-token quantization keeps accuracy high. Standard for serving since 2022.
Floating-point at 8 bits (with 4-bit exponent and 3-bit mantissa, typically). 1 byte per param. H100s have FP8 Tensor Cores — actual hardware acceleration. Quality essentially indistinguishable from FP16.
Only 16 distinct values per block. 0.5 bytes per param. NF4 (used by QLoRA) and AWQ (used by serving stacks) both target the actual weight distribution. Quality drop ~3-5%.
The key insight: verifying tokens is cheaper than generating them. Run a small "draft" model to propose 4-8 tokens, then run the big model once to verify all of them in parallel. Accept the prefix that the big model agrees with, reject the rest. Net effect: 2-3× more output tokens per big-model forward pass.
Step 1. Small draft model proposes K tokens (say K=4). Step 2. Big model runs ONE forward pass on those K positions, comparing each draft token to what it would have generated. Step 3. Accept the longest matching prefix. If the draft was right for the first 3 tokens but diverged on the 4th, accept tokens 1-3, replace token 4 with the big model's choice, discard the rest. Step 4. Repeat. Mathematically equivalent to autoregressive decoding — same output distribution, fewer big-model passes.
Until 2023, serving stacks pre-allocated KV cache as one big contiguous block per request, sized for the maximum possible sequence length. Result: massive waste. A request that only generated 50 tokens still reserved memory for 4096.
vLLM's PagedAttention (Kwon et al., 2023) treats KV cache like virtual memory. Break it into small fixed-size pages (typically 16 tokens each). Each request gets a logical sequence of page pointers; actual physical pages can be anywhere in HBM.
The result is staggering: 2-24× higher throughput than naive batching. Most production serving stacks today (vLLM, TGI, TensorRT-LLM) use page-based KV cache.
Continuous batching is the other half. Instead of waiting for every request in a batch to finish before starting new ones, you swap requests in/out of the batch every decode step. The batch size stays full, even with wildly different sequence lengths. Combined with PagedAttention, it's how a single H100 can serve 100+ concurrent users.
If you're serving an LLM at scale in 2024, you're likely on one of these. Each picks different points on the throughput-latency-flexibility curve.
The most-used open-source LLM serving stack. Drop-in OpenAI-compatible HTTP server in a single command. Used by individual researchers, startups, and some major labs. The PagedAttention paper made this whole space rethink memory management.
NVIDIA's official serving stack. Each model gets compiled into an optimized engine, with hand-tuned kernels for specific architectures. Fastest if you're serving one or two models at huge scale. Painful for rapid experimentation or many model variants.
Designed for complex prompt structures: branching agent loops, parallel tool calls, structured JSON outputs. Caches KV cache for shared prefixes (system prompts, RAG context) automatically. The choice when you're serving agents rather than chat completions.
Hugging Face's own serving stack, integrated with HF Hub. One-command deployment of any HF model with continuous batching, quantization (GPTQ, AWQ, BnB), and OpenAI-compatible API. Often the default for teams already deep in the HF ecosystem.
Aim for 4/5. Wrong answers explain themselves.
You distinguished prefill from decode. You sized real KV caches. You watched speculative decoding accept and reject. You know why vLLM exists. When someone benchmarks "200 tokens/sec on a single H100," you can now ask the right next questions: at what batch size? at what context length? at what quantization? Those numbers don't exist in isolation.
Continue to Module 12