AI Skill Course Course 3 · Expert
Module 03 of 14
Course 3 · Module 03 · 85 minutes

The model
they trained is
not what you
talk to.

A raw pre-trained model is shockingly weird. It doesn't follow instructions. It doesn't refuse harmful requests. It doesn't even know it's an "assistant." Three additional training stages — SFT, reward modeling, and RLHF — turn that strange base into Claude, GPT-4, Gemini. This module unpacks what each stage actually does, and you'll feel the difference yourself.

You'll compare
3 stages side by side
You'll annotate
6 preference pairs
You'll see
Your own reward signal
Trace the pipeline
The post-training pipeline 01 pre-train trillions of tokens 02 SFT ~10k instructions 03 RLHF preferences base model "completes text" instruct model "follows orders" chat model "is helpful" approximate cost $100M+ $1M $500k
Part 01 · The post-training pipeline

Three stages.
Each fixes a specific problem.

Each stage takes the model from the previous stage and adds a new capability. Compute drops by ~100× at each step; the impact on final behavior grows. The most expensive stage is the first; the most important to user experience is the last.

// Stage 01

Pre-training

Next-token prediction
data:~15T tokens (web, books, code)
cost:~$100M · months on GPUs

The model learns the patterns of language by predicting the next token, trillions of times, on ~all the text ever digitized. Result: a giant pattern-completer. It can finish your sentence, but it can't answer your question.

// Stage 02

SFT

Supervised Fine-Tuning
data:~10k–100k examples
cost:~$1M · days on GPUs

Fine-tune on instruction → ideal response pairs written by human experts. The model learns "when someone asks me a question, I answer it." Result: it now responds. But may not respond well.

// Stage 03

RLHF / DPO

Preference alignment
data:~100k preference pairs
cost:~$500k · days on GPUs

Humans pick which of two responses is better. The model learns to produce responses humans prefer. Result: helpful, harmless, honest — and the personality you actually interact with.

Part 02 · Hands on · Same prompt, three models

Watch the same input
produce three different responses.

Pick a prompt. See exactly what a raw base model, an SFT-tuned model, and an RLHF-aligned model would each produce. The differences are the entire reason post-training exists.

How to read it.

These responses are representative of real model outputs at each training stage — based on published examples and well-documented behaviors. Notice how the same prompt elicits dramatically different responses depending on which training stage we stop at. Base tries to complete text patterns. SFT follows the instruction. RLHF produces the response humans actually prefer.

// Pick a prompt
U
User prompt
Base Pre-trained only
SFT + Instruction tuning
RLHF + Preference alignment
Part 03 · Pre-training

Predict the next token.
A trillion times.

The objective is shockingly simple.

For every text in the training data, scan through token by token and ask the model: given everything you've seen so far, what comes next? Compare the model's prediction to the actual next token, compute the loss, update the weights, move on.

That's it. The entire pre-training objective is one equation: maximize the probability of the next token given all previous tokens. No labels, no curated answers, no human feedback — just text in, predict, learn.

The training data is ~everything: web crawls, Wikipedia, code from GitHub, books, scientific papers, forum posts. For frontier models, ~15 trillion tokens. That's roughly the equivalent of every word a human would read in 30,000 lifetimes.

From this, the model picks up syntax, grammar, factual knowledge, reasoning patterns, style — but not the goal of being helpful. It learned to mimic text. It didn't learn what text it should produce.

// Anatomy of pre-training
Objectivenext-token prediction
Training data~15T tokens
Equivalent textAll of the public web
Duration~3-6 months
Hardware~10k+ GPUs (H100s)
Cost$50M – $1B+
Result"Base" / "foundation" model
Public examplesLlama-3-base, Mistral-base
Part 04 · Instruction tuning (SFT)

Teach it to respond.

// SFT training data · examples
Explain photosynthesis in one sentence.
Photosynthesis is the process by which plants convert sunlight, water, and CO₂ into glucose and oxygen using chlorophyll.
Write a Python function to reverse a string.
def reverse_string(s): return s[::-1]
What are the main causes of WWI?
The main causes were militarism, alliances, imperialism, and nationalism — often summarized as MAIN. The assassination of Archduke Franz Ferdinand in 1914 was the immediate trigger.
Translate "good morning" to French.
"Bonjour" (literally "good day") is the standard greeting. "Bonne matinée" exists but is rarely used as a greeting.

From completer to responder.

Show the model thousands of examples of the form (instruction, ideal response). Human contractors — typically with domain expertise — write the ideal responses by hand. The model is fine-tuned on these pairs using the same next-token objective as pre-training.

After SFT, the model has learned the format of being an assistant: when given a user instruction, produce a response that addresses it. The dataset is small relative to pre-training (10k–100k examples) but laser-targeted.

What SFT cannot teach: nuance. The model now responds, but it might respond unhelpfully, verbosely, or harmfully. It does what the data shows, no more, no less. Fixing that needs Stage 3.

Part 05 · Hands on · You as the labeler

Pick the better response.
Six times. Then look at what you taught us.

RLHF starts with humans rating pairs of model responses. Below: you become the labeler. After 6 pairs, we'll show you your implicit "reward signal" — a profile of what you preferred — and explain how this exact data shapes Claude's personality.

How it works.

You'll see six prompts, each with two candidate responses. Pick the one you'd rather receive. No "right" answers — your preferences matter. At the end, we'll show you the patterns. This is exactly what RLHF labelers do, except at scale (hundreds of thousands of pairs).

// User prompt
// Response A
// Response B
No wrong answers · click the response you'd rather receive

Your reward signal

Based on your 6 picks, here's what you implicitly told the model. This is the data RLHF turns into behavior.

What just happened

You labeled 6 preference pairs. A real RLHF dataset has ~100,000 such pairs, annotated by hundreds of humans. From this data, a reward model is trained to score any new (prompt, response) pair the way humans would. Then the language model is fine-tuned with reinforcement learning to maximize this reward.

The result: a model that produces responses your reward profile would approve of. Different labeler pools → different model personalities. This is why Claude, ChatGPT, and Gemini feel different even when their architectures and pre-training are similar.

Part 06 · How RLHF actually works

Three substeps.
One loop.

RLHF is itself a three-step process. You just completed step 1 (collect preferences). Here's what happens next.

01

Collect preferences

Human labeling at scale

For each prompt, generate 2-4 candidate responses. Show pairs to human labelers. Labelers pick the better one. Build a dataset of (prompt, chosen, rejected) triples.

Output: ~100k preference triples. Annotators trained on rubrics covering helpfulness, harmlessness, honesty (often called the 3H framework).
02

Train reward model

Predict human preferences

Take a copy of the SFT model, replace the LM head with a single scalar output. Train it to predict which of two responses a human would prefer, using the labeled pairs.

Loss: log σ(r(chosen) − r(rejected)). After training, this RM can score any (prompt, response) pair — automating the human's job.
03

RL fine-tuning (PPO)

Optimize the LM against the RM

Use the reward model as a critic. The LM generates responses, the RM scores them, PPO updates the LM to produce higher-scoring responses. A KL penalty prevents drifting too far from the SFT model.

Objective: max E[r(x, y) − β·KL(π||π_SFT)]. β controls how much the model can stray. Output: your final chat model.
Part 07 · The modern recipe book

RLHF is no longer the only
game in town.

Since the original 2022 papers, the alignment recipe has evolved. Two big ideas to know about:

2023 onwards

DPO

// Direct Preference Optimization

The mathematical trick: you don't actually need a separate reward model. The preference data can be used to optimize the language model directly via a clever reformulation of the RLHF objective.

Practical impact: simpler pipeline (no RM, no PPO), often comparable or better results, faster training. Many open-weight models (Llama 3, Mixtral) now use DPO instead of full RLHF.

RLHF (3 steps):
prefs → RM → PPO → final model
DPO (1 step):
prefs → direct LM update → final model
Anthropic, 2022

Constitutional AI

// RLAIF — RL from AI Feedback

The labor-saving trick: use AI to do the labeling. Give the AI a constitution — a set of principles describing good behavior. The AI critiques its own responses against the constitution and revises them. Then a model is trained on these AI-generated preferences.

Practical impact: scale labeling without armies of humans. Less expensive, more consistent, lets labs encode explicit values. This is core to how Claude is trained.

Human labelers say:
"Response A feels better than B"
Constitution says:
"Prefer the response that is more helpful and honest"
Part 08 · Knowledge check

Five questions on what
you just aligned.

Aim for 4/5. Wrong answers explain themselves.

Question 01 of 05

0/5

Continue
Course 3 · Module 03 complete

The personality you talk to
is a trained one.

You now know what each post-training stage does — and what it doesn't. You felt the difference between a base, SFT, and RLHF model. You generated your own reward signal. Most importantly: you understand that the "AI" you chat with is a layered artifact, built on top of a strange, raw, internet-trained base by carefully selected human (and sometimes AI) preferences.

Up next · Course 3 · Module 04

Mixture of Experts

Why are GPT-4 and Mixtral so much more efficient than their parameter counts suggest? Because they're not single networks — they're sparse mixtures. Each token gets routed through only a few "experts" out of many. Interactive: visualize the router deciding which experts to activate per token.

Continue to Module 04