Course 3 · Module 03 · 85 minutes

The model
they trained is
not what you
talk to.

A raw pre-trained model is shockingly weird. It doesn't follow instructions. It doesn't refuse harmful requests. It doesn't even know it's an "assistant." Three additional training stages — SFT, reward modeling, and RLHF — turn that strange base into Claude, GPT-4, Gemini. This module unpacks what each stage actually does, and you'll feel the difference yourself.

You'll compare

3 stages side by side

You'll annotate

6 preference pairs

You'll see

Your own reward signal

Trace the pipeline

Part 01 · The post-training pipeline

Three stages.
Each fixes a specific problem.

Each stage takes the model from the previous stage and adds a new capability. Compute drops by ~100× at each step; the impact on final behavior grows. The most expensive stage is the first; the most important to user experience is the last.

// Stage 01

Pre-training

Next-token prediction

data:~15T tokens (web, books, code)

cost:~$100M · months on GPUs

The model learns the patterns of language by predicting the next token, trillions of times, on ~all the text ever digitized. Result: a giant pattern-completer. It can finish your sentence, but it can't answer your question.

→

// Stage 02

SFT

Supervised Fine-Tuning

data:~10k–100k examples

cost:~$1M · days on GPUs

Fine-tune on instruction → ideal response pairs written by human experts. The model learns "when someone asks me a question, I answer it." Result: it now responds. But may not respond well.

→

// Stage 03

RLHF / DPO

Preference alignment

data:~100k preference pairs

cost:~$500k · days on GPUs

Humans pick which of two responses is better. The model learns to produce responses humans prefer. Result: helpful, harmless, honest — and the personality you actually interact with.

Part 02 · Hands on · Same prompt, three models

Watch the same input
produce three different responses.

Pick a prompt. See exactly what a raw base model, an SFT-tuned model, and an RLHF-aligned model would each produce. The differences are the entire reason post-training exists.

How to read it.

These responses are representative of real model outputs at each training stage — based on published examples and well-documented behaviors. Notice how the same prompt elicits dramatically different responses depending on which training stage we stop at. Base tries to complete text patterns. SFT follows the instruction. RLHF produces the response humans actually prefer.

// Pick a prompt

User prompt

—

Base Pre-trained only

—

SFT + Instruction tuning

—

RLHF + Preference alignment

—

Part 03 · Pre-training

Predict the next token.
A trillion times.

The objective is shockingly simple.

For every text in the training data, scan through token by token and ask the model: given everything you've seen so far, what comes next? Compare the model's prediction to the actual next token, compute the loss, update the weights, move on.

That's it. The entire pre-training objective is one equation: maximize the probability of the next token given all previous tokens. No labels, no curated answers, no human feedback — just text in, predict, learn.

The training data is ~everything: web crawls, Wikipedia, code from GitHub, books, scientific papers, forum posts. For frontier models, ~15 trillion tokens. That's roughly the equivalent of every word a human would read in 30,000 lifetimes.

From this, the model picks up syntax, grammar, factual knowledge, reasoning patterns, style — but not the goal of being helpful. It learned to mimic text. It didn't learn what text it should produce.

// Anatomy of pre-training

Objectivenext-token prediction

Training data~15T tokens

Equivalent textAll of the public web

Duration~3-6 months

Hardware~10k+ GPUs (H100s)

Cost$50M – $1B+

Result"Base" / "foundation" model

Public examplesLlama-3-base, Mistral-base

Part 04 · Instruction tuning (SFT)

Teach it to respond.

// SFT training data · examples

Explain photosynthesis in one sentence.

Photosynthesis is the process by which plants convert sunlight, water, and CO₂ into glucose and oxygen using chlorophyll.

Write a Python function to reverse a string.

def reverse_string(s): return s[::-1]

What are the main causes of WWI?

The main causes were militarism, alliances, imperialism, and nationalism — often summarized as MAIN. The assassination of Archduke Franz Ferdinand in 1914 was the immediate trigger.

Translate "good morning" to French.

"Bonjour" (literally "good day") is the standard greeting. "Bonne matinée" exists but is rarely used as a greeting.

From completer to responder.

Show the model thousands of examples of the form (instruction, ideal response). Human contractors — typically with domain expertise — write the ideal responses by hand. The model is fine-tuned on these pairs using the same next-token objective as pre-training.

After SFT, the model has learned the format of being an assistant: when given a user instruction, produce a response that addresses it. The dataset is small relative to pre-training (10k–100k examples) but laser-targeted.

What SFT cannot teach: nuance. The model now responds, but it might respond unhelpfully, verbosely, or harmfully. It does what the data shows, no more, no less. Fixing that needs Stage 3.

Part 05 · Hands on · You as the labeler

Pick the better response.
Six times. Then look at what you taught us.

RLHF starts with humans rating pairs of model responses. Below: you become the labeler. After 6 pairs, we'll show you your implicit "reward signal" — a profile of what you preferred — and explain how this exact data shapes Claude's personality.

How it works.

You'll see six prompts, each with two candidate responses. Pick the one you'd rather receive. No "right" answers — your preferences matter. At the end, we'll show you the patterns. This is exactly what RLHF labelers do, except at scale (hundreds of thousands of pairs).

// User prompt

—

// Response A

—

// Response B

—

No wrong answers · click the response you'd rather receive

Your reward signal

Based on your 6 picks, here's what you implicitly told the model. This is the data RLHF turns into behavior.

What just happened

You labeled 6 preference pairs. A real RLHF dataset has ~100,000 such pairs, annotated by hundreds of humans. From this data, a reward model is trained to score any new (prompt, response) pair the way humans would. Then the language model is fine-tuned with reinforcement learning to maximize this reward.

The result: a model that produces responses your reward profile would approve of. Different labeler pools → different model personalities. This is why Claude, ChatGPT, and Gemini feel different even when their architectures and pre-training are similar.

Part 06 · How RLHF actually works

Three substeps.
One loop.

RLHF is itself a three-step process. You just completed step 1 (collect preferences). Here's what happens next.

Collect preferences

Human labeling at scale

For each prompt, generate 2-4 candidate responses. Show pairs to human labelers. Labelers pick the better one. Build a dataset of (prompt, chosen, rejected) triples.

Output: ~100k preference triples. Annotators trained on rubrics covering helpfulness, harmlessness, honesty (often called the 3H framework).

Train reward model

Predict human preferences

Take a copy of the SFT model, replace the LM head with a single scalar output. Train it to predict which of two responses a human would prefer, using the labeled pairs.

Loss: log σ(r(chosen) − r(rejected)). After training, this RM can score any (prompt, response) pair — automating the human's job.

RL fine-tuning (PPO)

Optimize the LM against the RM

Use the reward model as a critic. The LM generates responses, the RM scores them, PPO updates the LM to produce higher-scoring responses. A KL penalty prevents drifting too far from the SFT model.

Objective: max E[r(x, y) − β·KL(π||π_SFT)]. β controls how much the model can stray. Output: your final chat model.

Part 07 · The modern recipe book

RLHF is no longer the only
game in town.

Since the original 2022 papers, the alignment recipe has evolved. Two big ideas to know about:

2023 onwards

DPO

// Direct Preference Optimization

The mathematical trick: you don't actually need a separate reward model. The preference data can be used to optimize the language model directly via a clever reformulation of the RLHF objective.

Practical impact: simpler pipeline (no RM, no PPO), often comparable or better results, faster training. Many open-weight models (Llama 3, Mixtral) now use DPO instead of full RLHF.

RLHF (3 steps):

prefs → RM → PPO → final model

DPO (1 step):

prefs → direct LM update → final model

Anthropic, 2022

Constitutional AI

// RLAIF — RL from AI Feedback

The labor-saving trick: use AI to do the labeling. Give the AI a constitution — a set of principles describing good behavior. The AI critiques its own responses against the constitution and revises them. Then a model is trained on these AI-generated preferences.

Practical impact: scale labeling without armies of humans. Less expensive, more consistent, lets labs encode explicit values. This is core to how Claude is trained.

Human labelers say:

"Response A feels better than B"

Constitution says:

"Prefer the response that is more helpful and honest"

Course 3 · Module 03 complete

The personality you talk to
is a trained one.

You now know what each post-training stage does — and what it doesn't. You felt the difference between a base, SFT, and RLHF model. You generated your own reward signal. Most importantly: you understand that the "AI" you chat with is a layered artifact, built on top of a strange, raw, internet-trained base by carefully selected human (and sometimes AI) preferences.

Up next · Course 3 · Module 04

Mixture of Experts

Why are GPT-4 and Mixtral so much more efficient than their parameter counts suggest? Because they're not single networks — they're sparse mixtures. Each token gets routed through only a few "experts" out of many. Interactive: visualize the router deciding which experts to activate per token.

Continue to Module 04

The modelthey trained isnot what youtalk to.

Three stages.Each fixes a specific problem.