AI Skill Course Course 3 · Expert
Module 11 of 14
Course 3 · Module 11 · 85 minutes

Training
is half.
Serving
is the other.

Pretraining a frontier model costs $100M and happens once. Then it runs trillions of times. Every API call, every chat turn, every line of generated code is an inference. The economics of every LLM product collapse without inference optimization. This module covers the four tricks that make 200-token-per-second serving possible: KV cache, quantization, speculative decoding, paged attention.

You'll size
The KV cache
You'll watch
Speculative decoding
You'll grasp
Why vLLM exists
Speed it up
Inference, the hot path Prompt "Tell me a..." Prefill compute-bound · parallel Decode memory-bound · one-by-one repeat KV Cache grows with each token · the central bottleneck Most of the work is in the loop. prefill once · decode for every token in the output a 500-token reply = 500 forward passes
Part 01 · The two phases · very different problems

One uses the GPU's
math. The other uses its memory.

Inference has two distinct phases, and they bottleneck on different things. Optimizations that help one often don't help the other — sometimes they conflict. Understanding the split is the first step to thinking clearly about serving.

Prefill · once per request

Process the prompt

// compute-bound
N input tokens · all processed in parallel ... up to N one big matrix multiply · GPU loves this

For a 1000-token prompt, all 1000 tokens go through the model at once. One enormous matrix multiplication. The GPU's compute units (Tensor Cores) saturate — they're built for exactly this.

Prefill is fast per-token but expensive total: roughly O(N²) in attention, O(N) in FFN. A 32K context prefill can take several seconds even on an H100.

BottleneckCompute · FLOPs
GPU utilization80-95%
What helpsFlash Attention
Decode · per output token

Generate the next token

// memory-bound
1 token at a time · attend to ALL previous KV KV cache · already in HBM +1 tiny compute · giant memory reads · GPU starves

For each output token, the model does one forward pass — but it has to read the entire KV cache from HBM. Almost no compute, almost all memory traffic. The Tensor Cores sit idle while data shuffles.

For a 500-token reply, that's 500 forward passes back-to-back. Each one is memory-bound, so the GPU's headline TFLOPS number is mostly irrelevant. What matters is HBM bandwidth.

BottleneckMemory bandwidth
GPU utilization15-30%
What helpsQuantization, batching
Part 02 · Hands on · The KV cache calculator

How much memory
does "remembering" cost?

For each token you generate, the model has to keep the key and value vectors from every attention layer. This is the KV cache — and it grows linearly with sequence length. At a few thousand tokens it's manageable. At 200K it dominates GPU memory. Quantization is the single biggest lever you have.

The formula.

KV cache size = 2 · num_layers · num_heads · head_dim · seq_len · bytes_per_value. The "2" is for K and V (both are cached). Everything else is model architecture except the last two — which is why quantization (changing bytes per value) and shorter context are the only ways to shrink it without changing the model.

// Model
// Sequence length
8K
// Precision
// memory by precision · same model, same seq
Memory math
// for the current selection
Part 03 · Quantization · the precision spectrum

Sixteen bits or four:
the tradeoff matters.

Going from FP16 to INT4 cuts memory 4× and roughly doubles inference speed. But each step down trades some quality. The current sweet spots: FP8 for serving (matches FP16 quality on most workloads), NF4 for fine-tuning (works because most weights are near zero), INT4/AWQ for edge deployment (when memory is the limiting factor).

16

FP16 / BF16

// the baseline

The default for training and serving in 2020-2022. 2 bytes per param. BF16 has wider exponent range (better for training stability), FP16 has more mantissa precision.

Tradeoff Best quality. Largest memory. Used as the "ground truth" against which all quantized variants are compared.
8

INT8

// 2× smaller, near-lossless

16 distinct values per scale block, but with 8 bits you get 256 levels. 1 byte per param. Per-channel or per-token quantization keeps accuracy high. Standard for serving since 2022.

Tradeoff ~2% quality drop on most tasks. 2× smaller, often 1.5-2× faster. The reliable production choice.
8

FP8

// hardware-accelerated

Floating-point at 8 bits (with 4-bit exponent and 3-bit mantissa, typically). 1 byte per param. H100s have FP8 Tensor Cores — actual hardware acceleration. Quality essentially indistinguishable from FP16.

Tradeoff The new standard on H100s/H200s. ~2× throughput vs FP16. The current sweet spot for serving.
4

NF4 / INT4 / AWQ

// 4× smaller, mostly intact

Only 16 distinct values per block. 0.5 bytes per param. NF4 (used by QLoRA) and AWQ (used by serving stacks) both target the actual weight distribution. Quality drop ~3-5%.

Tradeoff 4× memory savings. Crucial for fitting big models on single GPUs. The choice when memory dominates everything.
Part 04 · Hands on · Speculative decoding

Two models. One verdict.
Mostly faster.

The key insight: verifying tokens is cheaper than generating them. Run a small "draft" model to propose 4-8 tokens, then run the big model once to verify all of them in parallel. Accept the prefix that the big model agrees with, reject the rest. Net effect: 2-3× more output tokens per big-model forward pass.

The protocol.

Step 1. Small draft model proposes K tokens (say K=4). Step 2. Big model runs ONE forward pass on those K positions, comparing each draft token to what it would have generated. Step 3. Accept the longest matching prefix. If the draft was right for the first 3 tokens but diverged on the 4th, accept tokens 1-3, replace token 4 with the big model's choice, discard the rest. Step 4. Repeat. Mathematically equivalent to autoregressive decoding — same output distribution, fewer big-model passes.

// prompt "The capital of France is"
Press play to watch the decoding loop
// Tokens generated
0
total output tokens
// Big-model passes
0
vs same number for vanilla
// Speedup
tokens per big-model pass
Part 05 · The throughput unlock · vLLM and PagedAttention

Memory fragmentation
was the real enemy.

Allocate KV cache like an OS pages memory.

Until 2023, serving stacks pre-allocated KV cache as one big contiguous block per request, sized for the maximum possible sequence length. Result: massive waste. A request that only generated 50 tokens still reserved memory for 4096.

vLLM's PagedAttention (Kwon et al., 2023) treats KV cache like virtual memory. Break it into small fixed-size pages (typically 16 tokens each). Each request gets a logical sequence of page pointers; actual physical pages can be anywhere in HBM.

The result is staggering: 2-24× higher throughput than naive batching. Most production serving stacks today (vLLM, TGI, TensorRT-LLM) use page-based KV cache.

Continuous batching is the other half. Instead of waiting for every request in a batch to finish before starting new ones, you swap requests in/out of the batch every decode step. The batch size stays full, even with wildly different sequence lengths. Combined with PagedAttention, it's how a single H100 can serve 100+ concurrent users.

// static vs continuous batching
Static batching (old) all wait for the slowest in the batch req 1 ←wasted GPU slot req 2 req 3 ≈ 40-60% GPU utilization Continuous batching (vLLM) new requests slot in as old ones finish req 1 req 4 req 2 req 3 req 5 ≈ 85-95% GPU utilization The throughput unlock 2-24× higher tokens/sec with the same GPU why every modern serving stack does this
Part 06 · What people actually use

Three stacks that
most of inference runs through.

If you're serving an LLM at scale in 2024, you're likely on one of these. Each picks different points on the throughput-latency-flexibility curve.

UC Berkeley · 2023

vLLM

// the open-source default
DistinctivePagedAttention + continuous batching
Throughput vs naive2-24× higher
SupportsLlama, Mistral, Qwen, ~50+ models
Multi-LoRA servingYes · hot-swap adapters
LicenseApache 2.0

The most-used open-source LLM serving stack. Drop-in OpenAI-compatible HTTP server in a single command. Used by individual researchers, startups, and some major labs. The PagedAttention paper made this whole space rethink memory management.

NVIDIA · 2023

TensorRT-LLM

// peak performance on NVIDIA hardware
DistinctiveHand-tuned NVIDIA kernels
Performance~1.5-2× vs vLLM
SupportsFP8 first-class · on H100s
TradeoffCompile per-model · slower iteration
Use caseProduction at scale

NVIDIA's official serving stack. Each model gets compiled into an optimized engine, with hand-tuned kernels for specific architectures. Fastest if you're serving one or two models at huge scale. Painful for rapid experimentation or many model variants.

UC Berkeley · 2024

SGLang

// programmable inference
DistinctiveRadixAttention · prefix caching
StrengthStructured outputs · agents
Speedup~3-5× on shared-prefix workloads
NicheTool-use heavy, multi-turn chat
LicenseApache 2.0

Designed for complex prompt structures: branching agent loops, parallel tool calls, structured JSON outputs. Caches KV cache for shared prefixes (system prompts, RAG context) automatically. The choice when you're serving agents rather than chat completions.

Hugging Face · 2023

TGI · Text Generation Inference

// HF's serving stack
DistinctiveTight HF Hub integration
StrengthEasy deployment of HF models
PerformanceComparable to vLLM
Best forStandard HF model serving
LicenseApache 2.0

Hugging Face's own serving stack, integrated with HF Hub. One-command deployment of any HF model with continuous batching, quantization (GPTQ, AWQ, BnB), and OpenAI-compatible API. Often the default for teams already deep in the HF ecosystem.

Part 07 · Knowledge check

Five questions on what
you just accelerated.

Aim for 4/5. Wrong answers explain themselves.

Question 01 of 05

0/5

Continue
Course 3 · Module 11 complete

You can now read
a serving spec honestly.

You distinguished prefill from decode. You sized real KV caches. You watched speculative decoding accept and reject. You know why vLLM exists. When someone benchmarks "200 tokens/sec on a single H100," you can now ask the right next questions: at what batch size? at what context length? at what quantization? Those numbers don't exist in isolation.

Up next · Course 3 · Module 12

Vector Databases at Scale

Most production AI systems are RAG-based, not pure LLM calls. The retrieval layer matters as much as the generator. Vector databases are how you store and query billions of embeddings in milliseconds. HNSW, IVF, product quantization, hybrid search. Interactive: watch HNSW build its graph layer by layer and search through it.

Continue to Module 12