Course 3 · Module 09 · 110 minutes

An agent
is an LLM
with a job.

Tool use (Module 8) gave the model hands. Multi-step reasoning gives it persistence. The patterns in this module — chain-of-thought, ReAct, self-consistency, reflection, planning — are what turn a single tool call into a 50-step autonomous task. They're the difference between Claude answering a question and Claude Code shipping a feature.

You'll compare

4 reasoning patterns

You'll watch

A live ReAct loop

You'll grasp

How Claude Code thinks

Show your work

Part 01 · The simplest trick that started it all

"Let's think
step by step."

Five words that unlock multi-step reasoning.

In 2022, Wei et al. discovered that simply prompting the model to "think step by step" before answering massively improved multi-step problems. Math word problems. Logic puzzles. Reading comprehension. Adding nine words to the prompt sometimes doubled accuracy.

The mechanism: without CoT, the model has to compute the final answer in a single forward pass. With CoT, it generates intermediate tokens that themselves become context for the next tokens. Each step of reasoning gets to attend to all the previous reasoning. The model effectively gets more "thinking time."

Modern frontier models do this implicitly — they've been trained to chain-of-thought even when not explicitly prompted. Reasoning models like o1, o3, and Claude 3.7's extended thinking take this further: thousands of internal reasoning tokens before the user sees any output.

This insight — that intermediate generation helps — is the foundation that all later patterns build on. ReAct, reflection, self-consistency — all of them are CoT with extra structure.

// the canonical CoT example

// Direct prompt

"Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 balls. How many balls does Roger have now?"

→ "27" WRONG

// With "Let's think step by step"

"Roger has 5 tennis balls..."

→ "Roger started with 5 balls. 2 cans × 3 balls = 6 balls. Total: 5 + 6 = 11." CORRECT

"Reasoning steps elicit reasoning." — Wei et al., Chain-of-Thought Prompting Elicits Reasoning, 2022

Part 02 · Hands on · Four ways to reason

Same problem.
Four structures.

Once you accept that "intermediate generation helps," the next question is what kind of intermediate generation. Below: four reasoning patterns applied to the same multi-step math problem. Click any pattern to see its structure and full trace.

The pattern decides the cost-vs-quality tradeoff.

Each pattern produces different output, takes different time, costs different money. Direct is fastest but unreliable. CoT is the default. Self-Consistency trades 5-10× cost for the highest accuracy. Reflection catches errors but doubles latency. No pattern is universally best — the right choice depends on the stakes.

// the shared test problem

"A train leaves Boston at 3:00pm traveling 60mph toward NYC. Another leaves NYC at 4:00pm traveling 80mph toward Boston. Boston-NYC is 215 miles. At what time do they meet?"

// correct answer: 5:06 PM

// structure

// reasoning trace

—

Part 03 · ReAct — reasoning meets tool use

When the agent
needs to look something up.

// ReAct trajectory · 2 cycles

Chain-of-thought, but with tool calls mixed in.

ReAct (Yao et al., 2022) was the first framework to combine the two ideas. The model interleaves three kinds of tokens: Thought (reasoning), Action (tool call), Observation (tool result).

The genius is that each Thought can react to the previous Observation. The model isn't just emitting tool calls — it's explaining its plan, considering what it just learned, and adjusting before each call. The Thoughts are visible to the model in the next step's context, building a coherent narrative of the agent's progress.

This pattern is now the default for every serious agent system. Claude Code, Devin, and modern AutoGPT all run ReAct under the hood, often with extra structure (planning, reflection, sub-agents). The core loop is the same.

Why it works: making reasoning explicit forces consistency. Without Thoughts, an agent might call get_weather("Tokyo") then forget what it was trying to compare to. With Thoughts, the agent's "intent" is in its own context window, anchoring every next decision.

Part 04 · Hands on · The agent in action

Pick a task.
Watch it unfold.

Three real multi-step tasks. Step through one cycle at a time, or play through. Each cycle = one Thought + Action + Observation. Watch how the agent's plan evolves as new information comes in.

What you're watching.

This is the actual ReAct loop running. The Thoughts are the agent's internal reasoning (visible because we show them — production agents often render these in real-time too). The Actions are tool calls. The Observations are tool results. Notice how each Thought references what was just learned. That's how multi-step coherence happens.

// pick a task

Cycle 0 / —

Pick a task above to begin

Part 05 · How agents decompose

Big task.
Smaller pieces.

When the task is small, ReAct alone is enough — think, act, observe, done. When the task is large ("refactor this whole codebase," "research this entire industry"), the agent needs to plan first, then execute. Four common decomposition strategies.

// 01

Linear plan

The agent generates a numbered list of steps before doing anything, then executes them in order. Simple, predictable. Breaks down when steps depend on findings from earlier steps (which is most real tasks).

Example plan 1. Read the README
2. Run the test suite
3. Identify failing tests
4. Read source for each failure
5. Apply fixes one by one
6. Re-run tests to confirm

// 02

Hierarchical · root + subgoals

The agent generates a tree: root task → high-level subgoals → concrete actions. Each subgoal is itself a mini-task that may spawn its own subgoals. This is how Claude Code handles complex refactors.

Example tree Goal: Add user auth
├─ Add database schema
│ ├─ Design User model
│ └─ Write migration
├─ Add login endpoint
└─ Add session middleware

// 03

Reactive · no plan

Just keep running ReAct cycles until done. No upfront plan. The agent decides each next step based purely on what it has observed so far. Most flexible — but easy to get lost in long tasks.

Example trajectory Thought → Action → Obs
Thought → Action → Obs
... (just keep going) ...
Thought → "I'm done." → Final

// 04

Tree of Thoughts

For problems with multiple viable paths (puzzles, planning, creative tasks): generate several candidate next moves at each step, evaluate them, expand only the most promising. Like beam search for reasoning.

Example expansion At step 3:
├─ Option A → est. quality 0.6
├─ Option B → est. quality 0.9 ✓
└─ Option C → est. quality 0.4
[expand B, prune A/C]

Part 06 · Agents shipping today

The systems that are
actually working.

Theory aside, here are the agent systems people are actually using and shipping. Each picks different points on the autonomy-vs-control spectrum.

Anthropic · 2024

Claude Code

// agentic coding for terminals

ArchitectureReAct + tools + planning

Toolsread, write, bash, search, edit

Cycles per taskoften 50-200+

AutonomyAsks for confirmation on risky ops

Visible reasoningYes (extended thinking)

Runs in your terminal. Given a task ("fix this bug," "implement this feature"), reads the codebase, plans, edits files, runs tests, iterates. The clearest example of ReAct + planning at scale — often runs hundreds of cycles before declaring success.

Cognition · 2024

Devin

// fully autonomous SWE agent

ArchitectureReAct + browser + shell + IDE

DistinctiveBrowser-using agent

SWE-bench score~14% solo (when released)

GoalWhole-task autonomy

ModeAsync (works overnight)

Aimed for "give it a Jira ticket, get back a PR." More autonomous than Claude Code — runs longer-horizon tasks without check-ins. Made headlines for being the first credible end-to-end SWE agent, even if its initial benchmarks were modest.

Open-source · 2023

AutoGPT / BabyAGI

// the OG "give it a goal" agents

ArchitecturePlan + ReAct + memory

DistinctiveBuilt-in vector memory

Pattern"goal-driven" loop

StatusShowed what was possible

Lasting impactInspired everything after

Released in 2023 when GPT-4 was new. Both went viral by showing "give a goal, the agent figures it out" worked at all. Often got stuck in loops — but the agentic primitives they popularized (planning, vector memory, tool use, self-evaluation) are now standard.

OpenAI · 2024

o1 / o3 / "reasoning models"

// agentic, internalized

ArchitectureCoT trained into the model

Tokens per responseoften 1000s of "thinking"

DistinctiveReasoning happens before output

Best atMath, code, puzzles

TradeoffHigher latency + cost

Different approach: instead of orchestrating reasoning around the model, train the model to do longer reasoning natively. Pre-output CoT, often thousands of tokens of internal monologue. Claude 3.7+ "extended thinking" and Gemini's thinking models follow the same pattern.

Course 3 · Module 09 complete

The frontier of useful AI
is now mostly agentic.

You watched a ReAct loop unfold cycle by cycle. You compared four reasoning patterns side by side. You saw how planning decomposes big tasks. You understand why Claude Code can finish a feature you'd assign to a junior engineer. The mechanics are simple — Think, Act, Observe, Repeat — but applied recursively at scale, they're how AI starts shipping real work.

Up next · Course 3 · Module 10

Fine-tuning Techniques · LoRA & QLoRA

Pre-training a frontier model costs $100M+. Fine-tuning your own custom version used to cost $100k+. LoRA changed that. Train just 0.1% of parameters, get most of the benefit. QLoRA goes further — 4-bit quantization lets you fine-tune Llama-70B on a single GPU. Interactive: see why low-rank adapters work and what they cost to train.

Continue to Module 10

An agentis an LLMwith a job.

"Let's thinkstep by step."