A raw pre-trained model is shockingly weird. It doesn't follow instructions. It doesn't refuse harmful requests. It doesn't even know it's an "assistant." Three additional training stages — SFT, reward modeling, and RLHF — turn that strange base into Claude, GPT-4, Gemini. This module unpacks what each stage actually does, and you'll feel the difference yourself.
Trace the pipelineEach stage takes the model from the previous stage and adds a new capability. Compute drops by ~100× at each step; the impact on final behavior grows. The most expensive stage is the first; the most important to user experience is the last.
The model learns the patterns of language by predicting the next token, trillions of times, on ~all the text ever digitized. Result: a giant pattern-completer. It can finish your sentence, but it can't answer your question.
Fine-tune on instruction → ideal response pairs written by human experts. The model learns "when someone asks me a question, I answer it." Result: it now responds. But may not respond well.
Humans pick which of two responses is better. The model learns to produce responses humans prefer. Result: helpful, harmless, honest — and the personality you actually interact with.
Pick a prompt. See exactly what a raw base model, an SFT-tuned model, and an RLHF-aligned model would each produce. The differences are the entire reason post-training exists.
These responses are representative of real model outputs at each training stage — based on published examples and well-documented behaviors. Notice how the same prompt elicits dramatically different responses depending on which training stage we stop at. Base tries to complete text patterns. SFT follows the instruction. RLHF produces the response humans actually prefer.
For every text in the training data, scan through token by token and ask the model: given everything you've seen so far, what comes next? Compare the model's prediction to the actual next token, compute the loss, update the weights, move on.
That's it. The entire pre-training objective is one equation: maximize the probability of the next token given all previous tokens. No labels, no curated answers, no human feedback — just text in, predict, learn.
The training data is ~everything: web crawls, Wikipedia, code from GitHub, books, scientific papers, forum posts. For frontier models, ~15 trillion tokens. That's roughly the equivalent of every word a human would read in 30,000 lifetimes.
From this, the model picks up syntax, grammar, factual knowledge, reasoning patterns, style — but not the goal of being helpful. It learned to mimic text. It didn't learn what text it should produce.
Show the model thousands of examples of the form (instruction, ideal response). Human contractors — typically with domain expertise — write the ideal responses by hand. The model is fine-tuned on these pairs using the same next-token objective as pre-training.
After SFT, the model has learned the format of being an assistant: when given a user instruction, produce a response that addresses it. The dataset is small relative to pre-training (10k–100k examples) but laser-targeted.
What SFT cannot teach: nuance. The model now responds, but it might respond unhelpfully, verbosely, or harmfully. It does what the data shows, no more, no less. Fixing that needs Stage 3.
RLHF starts with humans rating pairs of model responses. Below: you become the labeler. After 6 pairs, we'll show you your implicit "reward signal" — a profile of what you preferred — and explain how this exact data shapes Claude's personality.
You'll see six prompts, each with two candidate responses. Pick the one you'd rather receive. No "right" answers — your preferences matter. At the end, we'll show you the patterns. This is exactly what RLHF labelers do, except at scale (hundreds of thousands of pairs).
Based on your 6 picks, here's what you implicitly told the model. This is the data RLHF turns into behavior.
You labeled 6 preference pairs. A real RLHF dataset has ~100,000 such pairs, annotated by hundreds of humans. From this data, a reward model is trained to score any new (prompt, response) pair the way humans would. Then the language model is fine-tuned with reinforcement learning to maximize this reward.
The result: a model that produces responses your reward profile would approve of. Different labeler pools → different model personalities. This is why Claude, ChatGPT, and Gemini feel different even when their architectures and pre-training are similar.
RLHF is itself a three-step process. You just completed step 1 (collect preferences). Here's what happens next.
For each prompt, generate 2-4 candidate responses. Show pairs to human labelers. Labelers pick the better one. Build a dataset of (prompt, chosen, rejected) triples.
Take a copy of the SFT model, replace the LM head with a single scalar output. Train it to predict which of two responses a human would prefer, using the labeled pairs.
Use the reward model as a critic. The LM generates responses, the RM scores them, PPO updates the LM to produce higher-scoring responses. A KL penalty prevents drifting too far from the SFT model.
Since the original 2022 papers, the alignment recipe has evolved. Two big ideas to know about:
The mathematical trick: you don't actually need a separate reward model. The preference data can be used to optimize the language model directly via a clever reformulation of the RLHF objective.
Practical impact: simpler pipeline (no RM, no PPO), often comparable or better results, faster training. Many open-weight models (Llama 3, Mixtral) now use DPO instead of full RLHF.
The labor-saving trick: use AI to do the labeling. Give the AI a constitution — a set of principles describing good behavior. The AI critiques its own responses against the constitution and revises them. Then a model is trained on these AI-generated preferences.
Practical impact: scale labeling without armies of humans. Less expensive, more consistent, lets labs encode explicit values. This is core to how Claude is trained.
Aim for 4/5. Wrong answers explain themselves.
You now know what each post-training stage does — and what it doesn't. You felt the difference between a base, SFT, and RLHF model. You generated your own reward signal. Most importantly: you understand that the "AI" you chat with is a layered artifact, built on top of a strange, raw, internet-trained base by carefully selected human (and sometimes AI) preferences.
Continue to Module 04