Course 3 · Module 10 · Fine-tuning

Part 01 · When to fine-tune (and when not to)

Most "we need a custom model"
problems aren't.

Try cheaper things first.

Fine-tuning is irreversible engineering. Every variant of the model you fine-tune needs its own training run, its own evaluation, its own deployment slot. Cheaper approaches usually win — and most of the time, you don't need to fine-tune at all.

The escalation order: prompting → few-shot examples → tool use → RAG → LoRA → full fine-tuning. Each step adds cost and complexity. Stop at the first one that meets your quality bar.

When does fine-tuning genuinely win? Style and tone the model can't pick up from examples. Specialized vocabulary (medical, legal, internal jargon). Output format consistency for structured outputs at scale. Latency-critical inference where you can't afford long prompts. And — increasingly rarely — domain knowledge a frontier model genuinely lacks.

What fine-tuning doesn't fix: hallucinations, factual recall, reasoning errors, knowledge cutoffs. If your problem is "the model doesn't know X," the answer is usually retrieval, not fine-tuning.

// the escalation order

Part 02 · Hands on · The rank trick

Most weight updates
are low rank.

LoRA's bet (Hu et al., 2021): when you fine-tune a model on a new task, the change to each weight matrix has low intrinsic rank — meaning it can be expressed as the product of two much smaller matrices. So instead of training the full W, train two small ones (A and B) whose product approximates the update. Same expressive change, 100-1000× fewer trainable parameters.

What you're tuning below.

Two knobs: d (model dim) and r (LoRA rank). The visualization shows W as a frozen blue square, and the trainable LoRA pair B (d×r) and A (r×d). Slide r down — see how the orange rectangles get thin. Slide d up — see what happens at real model sizes (4096 for Llama-7B, 8192 for 70B). Watch the trainable-parameter count fall by orders of magnitude.

d · model dim 4096

r · LoRA rank 8

// matrices, drawn to relative scale

// W · frozen pretrained

16.78M

d² = 4096² params per layer

// B + A · LoRA trainable

65.5K

2·d·r = 2·4096·8 per layer

// the saving

256×

fewer trainable params · per layer

Part 03 · Hands on · GPU memory calculator

Pick a model size.
Pick a method. See what fits.

The reason fine-tuning got accessible isn't just fewer trainable params — it's fewer optimizer states. AdamW keeps 2× extra memory per trainable param (momentum + variance). Full fine-tuning a 70B model needs 1.4 TB. LoRA needs 160 GB. QLoRA needs 48 GB — fits on one GPU.

How memory breaks down for training.

Four categories: model weights (always loaded), LoRA params (only with LoRA), optimizer state (2× trainable params at BF16), activations (depends on batch + seq length). For full fine-tuning, optimizer state dominates. For LoRA, model weights dominate. For QLoRA, even the weights shrink 4× via 4-bit quantization. The ratios matter more than the raw numbers.

// Model size

// Fine-tuning method

—

// memory breakdown

Total GPU memory: — GB

GPU compatibility

// usage vs capacity

—

Part 04 · QLoRA · the 4-bit unlock

Quantize the frozen weights.
Train the adapter in full precision.

// QLoRA · how the precisions mix

4-bit precision for the dead weight you're not changing.

If you're not training W (LoRA freezes it), why keep it in BF16? QLoRA's insight (Dettmers et al., 2023): quantize the frozen base model to 4-bit precision — saving 4× memory — and only keep the small LoRA adapters at full precision.

The technical trick is NF4 — "NormalFloat 4-bit" — a number format optimized for the actual distribution of neural network weights (which is roughly normally distributed). Standard 4-bit integer quantization loses too much information. NF4 keeps almost all of it.

During training, the frozen W stays 4-bit in memory. When you need it for a forward/backward pass, it's dequantized on-the-fly in small blocks (256 elements at a time) — fast enough that compute isn't the bottleneck. Gradients only flow through A and B, which stay BF16.

The result is staggering: Llama-70B fine-tuning, which used to need a multi-node cluster, now fits on a single A6000 (48 GB) or H100 (80 GB) — and matches the quality of full 16-bit LoRA. The QLoRA paper showed this works for models up to 65B without measurable quality loss.

Part 05 · The training recipe

Four knobs that actually matter.

LoRA looks like it has dozens of hyperparameters. In practice, four of them dominate. Set these well, and the rest barely matters.

// 01

Rank `r`

The size of the adapter matrices. r=8 is the de facto default — works well across most tasks. Use r=16 or 32 if your task involves significant style/format shifts (long-form writing, structured generation). Bigger r usually hurts more than helps — it overfits faster.

Typical values • Classification: r=4 or 8
• Instruction tuning: r=8 or 16
• Domain-specific style: r=16 or 32
• Almost never need r > 64

// 02

Alpha `α`

The "loudness" of the LoRA update. The actual contribution to W is (α/r) · BA. Standard recipe: set α = 2r (so the ratio is constant at 2). This decouples r from learning rate sensitivity — you can change r without re-tuning the LR.

Common choice • α = 16 with r = 8 → ratio of 2
• α = 32 with r = 16 → ratio of 2
• Some recipes use α = r (ratio = 1) instead

// 03

Target modules

Which weight matrices get LoRA adapters attached. Just attaching to attention's Q and V works surprisingly well. Attaching to all linear layers (Q, K, V, O, gate, up, down) gives a bit more quality but 4-7× the trainable params. The original LoRA paper showed Q+V is enough for most cases.

The two recipes • Minimal: q_proj, v_proj
• Maximal: q, k, v, o, gate, up, down
• Newer recipes prefer "all linear"

// 04

Learning rate

LoRA can take much higher learning rates than full fine-tuning — typically 10-100× higher. Why: the small number of trainable params makes the loss surface smoother. 1e-4 to 3e-4 is a good default; some recipes go up to 1e-3 for short trainings.

Defaults that usually work • LoRA LR: 1e-4 to 3e-4
• QLoRA LR: 2e-4 to 5e-4
• Schedule: cosine with 3-5% warmup
• Full FT LR (for comparison): 1e-5 to 5e-5

Part 06 · LoRA in the wild

Where this is
actually shipping.

LoRA went from research paper to industry default in about 18 months. By 2024, "ship a custom model" in practice means "train a LoRA adapter."

Stable Diffusion · 2023

Image model LoRAs

// the consumer breakout moment

Use caseStyle transfer, character likeness

Adapter size10-200 MB (vs 4 GB model)

Training time10-60 min on a single GPU

DistributionCivitai · thousands of adapters

Why it took offSmall files, easy mixing

The first wave of mass LoRA adoption. Train an adapter on 20-100 images of a style or character, share the tiny adapter file, and anyone can apply it to their Stable Diffusion. Made customization a community phenomenon, not a corporate workflow.

Hugging Face · ongoing

PEFT library

// the standard training stack

WhatParameter-Efficient Fine-Tuning

MethodsLoRA, QLoRA, IA³, prefix tuning, ...

Integration3 lines on top of transformers

Combined withbitsandbytes (4/8-bit), TRL

Stars on GH~15k

The de facto Python library for LoRA training. model = get_peft_model(model, lora_config) and you're done. The companion library bitsandbytes handles QLoRA's 4-bit quantization. If you've fine-tuned any open LLM since 2023, you've used PEFT.

Multi-tenant

LoRA serving at scale

// thousands of adapters, one base model

PatternHot-swap adapters at inference

Per-customer LoRA~10 MB each

Base modelLoaded once, shared

FrameworksvLLM, S-LoRA, LoRAX

Per-customer costCents · not dollars

The economic killer feature: train one LoRA per customer, host thousands on the same base model. Switch adapters per request. Each tenant gets a "custom model" — the provider hosts ten thousand adapters on a handful of GPUs. This is how AI platforms offer "custom models" affordably.

Open-source models

Llama / Mistral / Qwen ecosystems

// the customization layer

Llama variants on HF~100,000+ (most are LoRAs)

Use casesDomain, language, style, safety

Training costOften < $100 on rented GPUs

Quality gap vs full FT~1-3% on most benchmarks

Time to trainHours · not weeks

The reason the open-weight ecosystem exists. Anyone with a GPU can produce a custom version of Llama, Mistral, or Qwen for $50-500, in hours. Combined with model weights being free, this is what makes Hugging Face's model zoo possible.

Course 3 · Module 10 complete

Custom models are no longer
a frontier-lab luxury.

You saw why low-rank updates work. You computed actual GPU memory for every Llama size. You understand why QLoRA fits a 70B fine-tune on a single GPU. When someone says "we trained our own model," in 2024 they almost certainly mean "we trained a LoRA adapter." Once you've internalized this, the modern customization landscape becomes obvious.

Up next · Course 3 · Module 11

Inference Optimization

Training is one half. Serving the trained model at scale is the other. KV cache, quantization (INT8, FP8, NF4), speculative decoding, batching strategies, prefill vs decode, paged attention. The tricks behind how Anthropic, OpenAI, and Google actually serve billions of tokens per day. Interactive: watch a speculative-decoding draft-and-verify loop run live.

Continue to Module 11

0.1%
of the params.
95% of the gain.

Most "we need a custom model"
problems aren't.

Try cheaper things first.

Most weight updates
are low rank.

Pick a model size.
Pick a method. See what fits.

Quantize the frozen weights.
Train the adapter in full precision.

4-bit precision for the dead weight you're not changing.

Four knobs that actually matter.

Rank `r`

Alpha `α`

Target modules

Learning rate

Where this is
actually shipping.

Image model LoRAs

PEFT library

LoRA serving at scale

Llama / Mistral / Qwen ecosystems

Five questions on what
you just tuned.

Custom models are no longer
a frontier-lab luxury.

Inference Optimization

0.1%of the params.95% of the gain.

Most "we need a custom model"problems aren't.

Try cheaper things first.

Most weight updatesare low rank.

Pick a model size.Pick a method. See what fits.

Quantize the frozen weights.Train the adapter in full precision.

4-bit precision for the dead weight you're not changing.

Four knobs that actually matter.

Rank r

Alpha α

Target modules

Learning rate

Where this isactually shipping.

Image model LoRAs

PEFT library

LoRA serving at scale

Llama / Mistral / Qwen ecosystems

Five questions on whatyou just tuned.

Custom models are no longera frontier-lab luxury.

Inference Optimization

0.1%
of the params.
95% of the gain.

Most "we need a custom model"
problems aren't.

Most weight updates
are low rank.

Pick a model size.
Pick a method. See what fits.

Quantize the frozen weights.
Train the adapter in full precision.

Rank `r`

Alpha `α`

Where this is
actually shipping.

Five questions on what
you just tuned.

Custom models are no longer
a frontier-lab luxury.