Pre-training a frontier model costs $100M+. Until 2021, fine-tuning your own version cost $100k+ — full passes over billions of parameters. LoRA flipped this. Train tiny adapter matrices instead. Match full fine-tuning quality. QLoRA pushed further — Llama-70B fits on a single GPU. This module covers the math, the memory, the recipes.
Tune it downFine-tuning is irreversible engineering. Every variant of the model you fine-tune needs its own training run, its own evaluation, its own deployment slot. Cheaper approaches usually win — and most of the time, you don't need to fine-tune at all.
The escalation order: prompting → few-shot examples → tool use → RAG → LoRA → full fine-tuning. Each step adds cost and complexity. Stop at the first one that meets your quality bar.
When does fine-tuning genuinely win? Style and tone the model can't pick up from examples. Specialized vocabulary (medical, legal, internal jargon). Output format consistency for structured outputs at scale. Latency-critical inference where you can't afford long prompts. And — increasingly rarely — domain knowledge a frontier model genuinely lacks.
What fine-tuning doesn't fix: hallucinations, factual recall, reasoning errors, knowledge cutoffs. If your problem is "the model doesn't know X," the answer is usually retrieval, not fine-tuning.
LoRA's bet (Hu et al., 2021): when you fine-tune a model on a new task, the change to each weight matrix has low intrinsic rank — meaning it can be expressed as the product of two much smaller matrices. So instead of training the full W, train two small ones (A and B) whose product approximates the update. Same expressive change, 100-1000× fewer trainable parameters.
Two knobs: d (model dim) and r (LoRA rank). The visualization shows W as a frozen blue square, and the trainable LoRA pair B (d×r) and A (r×d). Slide r down — see how the orange rectangles get thin. Slide d up — see what happens at real model sizes (4096 for Llama-7B, 8192 for 70B). Watch the trainable-parameter count fall by orders of magnitude.
The reason fine-tuning got accessible isn't just fewer trainable params — it's fewer optimizer states. AdamW keeps 2× extra memory per trainable param (momentum + variance). Full fine-tuning a 70B model needs 1.4 TB. LoRA needs 160 GB. QLoRA needs 48 GB — fits on one GPU.
Four categories: model weights (always loaded), LoRA params (only with LoRA), optimizer state (2× trainable params at BF16), activations (depends on batch + seq length). For full fine-tuning, optimizer state dominates. For LoRA, model weights dominate. For QLoRA, even the weights shrink 4× via 4-bit quantization. The ratios matter more than the raw numbers.
If you're not training W (LoRA freezes it), why keep it in BF16? QLoRA's insight (Dettmers et al., 2023): quantize the frozen base model to 4-bit precision — saving 4× memory — and only keep the small LoRA adapters at full precision.
The technical trick is NF4 — "NormalFloat 4-bit" — a number format optimized for the actual distribution of neural network weights (which is roughly normally distributed). Standard 4-bit integer quantization loses too much information. NF4 keeps almost all of it.
During training, the frozen W stays 4-bit in memory. When you need it for a forward/backward pass, it's dequantized on-the-fly in small blocks (256 elements at a time) — fast enough that compute isn't the bottleneck. Gradients only flow through A and B, which stay BF16.
The result is staggering: Llama-70B fine-tuning, which used to need a multi-node cluster, now fits on a single A6000 (48 GB) or H100 (80 GB) — and matches the quality of full 16-bit LoRA. The QLoRA paper showed this works for models up to 65B without measurable quality loss.
LoRA looks like it has dozens of hyperparameters. In practice, four of them dominate. Set these well, and the rest barely matters.
rThe size of the adapter matrices. r=8 is the de facto default — works well across most tasks. Use r=16 or 32 if your task involves significant style/format shifts (long-form writing, structured generation). Bigger r usually hurts more than helps — it overfits faster.
αThe "loudness" of the LoRA update. The actual contribution to W is (α/r) · BA. Standard recipe: set α = 2r (so the ratio is constant at 2). This decouples r from learning rate sensitivity — you can change r without re-tuning the LR.
Which weight matrices get LoRA adapters attached. Just attaching to attention's Q and V works surprisingly well. Attaching to all linear layers (Q, K, V, O, gate, up, down) gives a bit more quality but 4-7× the trainable params. The original LoRA paper showed Q+V is enough for most cases.
q_proj, v_projq, k, v, o, gate, up, downLoRA can take much higher learning rates than full fine-tuning — typically 10-100× higher. Why: the small number of trainable params makes the loss surface smoother. 1e-4 to 3e-4 is a good default; some recipes go up to 1e-3 for short trainings.
LoRA went from research paper to industry default in about 18 months. By 2024, "ship a custom model" in practice means "train a LoRA adapter."
The first wave of mass LoRA adoption. Train an adapter on 20-100 images of a style or character, share the tiny adapter file, and anyone can apply it to their Stable Diffusion. Made customization a community phenomenon, not a corporate workflow.
The de facto Python library for LoRA training. model = get_peft_model(model, lora_config) and you're done. The companion library bitsandbytes handles QLoRA's 4-bit quantization. If you've fine-tuned any open LLM since 2023, you've used PEFT.
The economic killer feature: train one LoRA per customer, host thousands on the same base model. Switch adapters per request. Each tenant gets a "custom model" — the provider hosts ten thousand adapters on a handful of GPUs. This is how AI platforms offer "custom models" affordably.
The reason the open-weight ecosystem exists. Anyone with a GPU can produce a custom version of Llama, Mistral, or Qwen for $50-500, in hours. Combined with model weights being free, this is what makes Hugging Face's model zoo possible.
Aim for 4/5. Wrong answers explain themselves.
You saw why low-rank updates work. You computed actual GPU memory for every Llama size. You understand why QLoRA fits a 70B fine-tune on a single GPU. When someone says "we trained our own model," in 2024 they almost certainly mean "we trained a LoRA adapter." Once you've internalized this, the modern customization landscape becomes obvious.
Continue to Module 11