AI Skill Course Course 3 · Expert
Module 10 of 14
Course 3 · Module 10 · 90 minutes

0.1%
of the params.
95% of the gain.

Pre-training a frontier model costs $100M+. Until 2021, fine-tuning your own version cost $100k+ — full passes over billions of parameters. LoRA flipped this. Train tiny adapter matrices instead. Match full fine-tuning quality. QLoRA pushed further — Llama-70B fits on a single GPU. This module covers the math, the memory, the recipes.

You'll see
The rank trick
You'll compute
GPU memory needs
You'll learn
When to fine-tune
Tune it down
Forward pass with LoRA x W (frozen) pretrained · 4096×4096 A r×d B d×r trainable + y y = W·x + (α/r) · B·A·x frozen pretrained + tiny trainable update If r=8: only 65,536 trainable params per layer vs 16.8M for full W · 256× fewer → much less memory, much faster training
Part 01 · When to fine-tune (and when not to)

Most "we need a custom model"
problems aren't.

Try cheaper things first.

Fine-tuning is irreversible engineering. Every variant of the model you fine-tune needs its own training run, its own evaluation, its own deployment slot. Cheaper approaches usually win — and most of the time, you don't need to fine-tune at all.

The escalation order: prompting → few-shot examples → tool use → RAG → LoRA → full fine-tuning. Each step adds cost and complexity. Stop at the first one that meets your quality bar.

When does fine-tuning genuinely win? Style and tone the model can't pick up from examples. Specialized vocabulary (medical, legal, internal jargon). Output format consistency for structured outputs at scale. Latency-critical inference where you can't afford long prompts. And — increasingly rarely — domain knowledge a frontier model genuinely lacks.

What fine-tuning doesn't fix: hallucinations, factual recall, reasoning errors, knowledge cutoffs. If your problem is "the model doesn't know X," the answer is usually retrieval, not fine-tuning.

// the escalation order
1. Better prompting clearer instructions, structured output, CoT still wrong? 2. Few-shot examples 3-10 input→output pairs in the prompt still wrong? 3. Add tools / agent model can search, compute, query DBs need domain data? 4. RAG retrieve relevant docs, stuff into context need style / format change? 5. LoRA fine-tuning small adapter weights, days of training need bigger change? 6. Full fine-tuning all weights · weeks of training · $$$ Most teams stop at step 4 forever. Step 5 (LoRA) is the sweet spot when prompting + RAG isn't quite enough. Step 6 is rare outside frontier labs.
Part 02 · Hands on · The rank trick

Most weight updates
are low rank.

LoRA's bet (Hu et al., 2021): when you fine-tune a model on a new task, the change to each weight matrix has low intrinsic rank — meaning it can be expressed as the product of two much smaller matrices. So instead of training the full W, train two small ones (A and B) whose product approximates the update. Same expressive change, 100-1000× fewer trainable parameters.

What you're tuning below.

Two knobs: d (model dim) and r (LoRA rank). The visualization shows W as a frozen blue square, and the trainable LoRA pair B (d×r) and A (r×d). Slide r down — see how the orange rectangles get thin. Slide d up — see what happens at real model sizes (4096 for Llama-7B, 8192 for 70B). Watch the trainable-parameter count fall by orders of magnitude.

d · model dim 4096
r · LoRA rank 8
// matrices, drawn to relative scale
// W · frozen pretrained
16.78M
d² = 4096² params per layer
// B + A · LoRA trainable
65.5K
2·d·r = 2·4096·8 per layer
// the saving
256×
fewer trainable params · per layer
Part 03 · Hands on · GPU memory calculator

Pick a model size.
Pick a method. See what fits.

The reason fine-tuning got accessible isn't just fewer trainable params — it's fewer optimizer states. AdamW keeps 2× extra memory per trainable param (momentum + variance). Full fine-tuning a 70B model needs 1.4 TB. LoRA needs 160 GB. QLoRA needs 48 GB — fits on one GPU.

How memory breaks down for training.

Four categories: model weights (always loaded), LoRA params (only with LoRA), optimizer state (2× trainable params at BF16), activations (depends on batch + seq length). For full fine-tuning, optimizer state dominates. For LoRA, model weights dominate. For QLoRA, even the weights shrink 4× via 4-bit quantization. The ratios matter more than the raw numbers.

// Model size
// Fine-tuning method
// memory breakdown
Total GPU memory: — GB
GPU compatibility
// usage vs capacity
Part 04 · QLoRA · the 4-bit unlock

Quantize the frozen weights.
Train the adapter in full precision.

// QLoRA · how the precisions mix
During training W (frozen) stored as NF4 (4-bit) 35 GB (was 140 GB) A, B (LoRA) BF16 (16-bit) ~1 GB forward / backward pass dequant on-the-fly NF4 → BF16 per block y = Wx + BAx gradient A, B updated W untouched W stays 4-bit no optimizer state The unlock Llama-70B fine-tune fits on a single 48 GB GPU previously required ~1.4 TB (multi-node cluster)

4-bit precision for the dead weight you're not changing.

If you're not training W (LoRA freezes it), why keep it in BF16? QLoRA's insight (Dettmers et al., 2023): quantize the frozen base model to 4-bit precision — saving 4× memory — and only keep the small LoRA adapters at full precision.

The technical trick is NF4 — "NormalFloat 4-bit" — a number format optimized for the actual distribution of neural network weights (which is roughly normally distributed). Standard 4-bit integer quantization loses too much information. NF4 keeps almost all of it.

During training, the frozen W stays 4-bit in memory. When you need it for a forward/backward pass, it's dequantized on-the-fly in small blocks (256 elements at a time) — fast enough that compute isn't the bottleneck. Gradients only flow through A and B, which stay BF16.

The result is staggering: Llama-70B fine-tuning, which used to need a multi-node cluster, now fits on a single A6000 (48 GB) or H100 (80 GB) — and matches the quality of full 16-bit LoRA. The QLoRA paper showed this works for models up to 65B without measurable quality loss.

Part 05 · The training recipe

Four knobs that actually matter.

LoRA looks like it has dozens of hyperparameters. In practice, four of them dominate. Set these well, and the rest barely matters.

// 01

Rank r

The size of the adapter matrices. r=8 is the de facto default — works well across most tasks. Use r=16 or 32 if your task involves significant style/format shifts (long-form writing, structured generation). Bigger r usually hurts more than helps — it overfits faster.

Typical values • Classification: r=4 or 8
• Instruction tuning: r=8 or 16
• Domain-specific style: r=16 or 32
• Almost never need r > 64
// 02

Alpha α

The "loudness" of the LoRA update. The actual contribution to W is (α/r) · BA. Standard recipe: set α = 2r (so the ratio is constant at 2). This decouples r from learning rate sensitivity — you can change r without re-tuning the LR.

Common choice • α = 16 with r = 8 → ratio of 2
• α = 32 with r = 16 → ratio of 2
• Some recipes use α = r (ratio = 1) instead
// 03

Target modules

Which weight matrices get LoRA adapters attached. Just attaching to attention's Q and V works surprisingly well. Attaching to all linear layers (Q, K, V, O, gate, up, down) gives a bit more quality but 4-7× the trainable params. The original LoRA paper showed Q+V is enough for most cases.

The two recipes • Minimal: q_proj, v_proj
• Maximal: q, k, v, o, gate, up, down
• Newer recipes prefer "all linear"
// 04

Learning rate

LoRA can take much higher learning rates than full fine-tuning — typically 10-100× higher. Why: the small number of trainable params makes the loss surface smoother. 1e-4 to 3e-4 is a good default; some recipes go up to 1e-3 for short trainings.

Defaults that usually work • LoRA LR: 1e-4 to 3e-4
• QLoRA LR: 2e-4 to 5e-4
• Schedule: cosine with 3-5% warmup
• Full FT LR (for comparison): 1e-5 to 5e-5
Part 06 · LoRA in the wild

Where this is
actually shipping.

LoRA went from research paper to industry default in about 18 months. By 2024, "ship a custom model" in practice means "train a LoRA adapter."

Stable Diffusion · 2023

Image model LoRAs

// the consumer breakout moment
Use caseStyle transfer, character likeness
Adapter size10-200 MB (vs 4 GB model)
Training time10-60 min on a single GPU
DistributionCivitai · thousands of adapters
Why it took offSmall files, easy mixing

The first wave of mass LoRA adoption. Train an adapter on 20-100 images of a style or character, share the tiny adapter file, and anyone can apply it to their Stable Diffusion. Made customization a community phenomenon, not a corporate workflow.

Hugging Face · ongoing

PEFT library

// the standard training stack
WhatParameter-Efficient Fine-Tuning
MethodsLoRA, QLoRA, IA³, prefix tuning, ...
Integration3 lines on top of transformers
Combined withbitsandbytes (4/8-bit), TRL
Stars on GH~15k

The de facto Python library for LoRA training. model = get_peft_model(model, lora_config) and you're done. The companion library bitsandbytes handles QLoRA's 4-bit quantization. If you've fine-tuned any open LLM since 2023, you've used PEFT.

Multi-tenant

LoRA serving at scale

// thousands of adapters, one base model
PatternHot-swap adapters at inference
Per-customer LoRA~10 MB each
Base modelLoaded once, shared
FrameworksvLLM, S-LoRA, LoRAX
Per-customer costCents · not dollars

The economic killer feature: train one LoRA per customer, host thousands on the same base model. Switch adapters per request. Each tenant gets a "custom model" — the provider hosts ten thousand adapters on a handful of GPUs. This is how AI platforms offer "custom models" affordably.

Open-source models

Llama / Mistral / Qwen ecosystems

// the customization layer
Llama variants on HF~100,000+ (most are LoRAs)
Use casesDomain, language, style, safety
Training costOften < $100 on rented GPUs
Quality gap vs full FT~1-3% on most benchmarks
Time to trainHours · not weeks

The reason the open-weight ecosystem exists. Anyone with a GPU can produce a custom version of Llama, Mistral, or Qwen for $50-500, in hours. Combined with model weights being free, this is what makes Hugging Face's model zoo possible.

Part 07 · Knowledge check

Five questions on what
you just tuned.

Aim for 4/5. Wrong answers explain themselves.

Question 01 of 05

0/5

Continue
Course 3 · Module 10 complete

Custom models are no longer
a frontier-lab luxury.

You saw why low-rank updates work. You computed actual GPU memory for every Llama size. You understand why QLoRA fits a 70B fine-tune on a single GPU. When someone says "we trained our own model," in 2024 they almost certainly mean "we trained a LoRA adapter." Once you've internalized this, the modern customization landscape becomes obvious.

Up next · Course 3 · Module 11

Inference Optimization

Training is one half. Serving the trained model at scale is the other. KV cache, quantization (INT8, FP8, NF4), speculative decoding, batching strategies, prefill vs decode, paged attention. The tricks behind how Anthropic, OpenAI, and Google actually serve billions of tokens per day. Interactive: watch a speculative-decoding draft-and-verify loop run live.

Continue to Module 11