AI Skill Course Course 3 · Expert
Module 09 of 14
Course 3 · Module 09 · 110 minutes

An agent
is an LLM
with a job.

Tool use (Module 8) gave the model hands. Multi-step reasoning gives it persistence. The patterns in this module — chain-of-thought, ReAct, self-consistency, reflection, planning — are what turn a single tool call into a 50-step autonomous task. They're the difference between Claude answering a question and Claude Code shipping a feature.

You'll compare
4 reasoning patterns
You'll watch
A live ReAct loop
You'll grasp
How Claude Code thinks
Show your work
The ReAct loop Task until done Thought what to do next Action tool call Observation tool result Think · act · observe · repeat the entire agentic AI pattern in 4 words
Part 01 · The simplest trick that started it all

"Let's think
step by step."

Five words that unlock multi-step reasoning.

In 2022, Wei et al. discovered that simply prompting the model to "think step by step" before answering massively improved multi-step problems. Math word problems. Logic puzzles. Reading comprehension. Adding nine words to the prompt sometimes doubled accuracy.

The mechanism: without CoT, the model has to compute the final answer in a single forward pass. With CoT, it generates intermediate tokens that themselves become context for the next tokens. Each step of reasoning gets to attend to all the previous reasoning. The model effectively gets more "thinking time."

Modern frontier models do this implicitly — they've been trained to chain-of-thought even when not explicitly prompted. Reasoning models like o1, o3, and Claude 3.7's extended thinking take this further: thousands of internal reasoning tokens before the user sees any output.

This insight — that intermediate generation helps — is the foundation that all later patterns build on. ReAct, reflection, self-consistency — all of them are CoT with extra structure.

// the canonical CoT example
// Direct prompt
"Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 balls. How many balls does Roger have now?"
→ "27" WRONG
// With "Let's think step by step"
"Roger has 5 tennis balls..."
→ "Roger started with 5 balls. 2 cans × 3 balls = 6 balls. Total: 5 + 6 = 11." CORRECT
"Reasoning steps elicit reasoning." — Wei et al., Chain-of-Thought Prompting Elicits Reasoning, 2022
Part 02 · Hands on · Four ways to reason

Same problem.
Four structures.

Once you accept that "intermediate generation helps," the next question is what kind of intermediate generation. Below: four reasoning patterns applied to the same multi-step math problem. Click any pattern to see its structure and full trace.

The pattern decides the cost-vs-quality tradeoff.

Each pattern produces different output, takes different time, costs different money. Direct is fastest but unreliable. CoT is the default. Self-Consistency trades 5-10× cost for the highest accuracy. Reflection catches errors but doubles latency. No pattern is universally best — the right choice depends on the stakes.

// the shared test problem
"A train leaves Boston at 3:00pm traveling 60mph toward NYC. Another leaves NYC at 4:00pm traveling 80mph toward Boston. Boston-NYC is 215 miles. At what time do they meet?"
// correct answer: 5:06 PM
// structure
// reasoning trace
Part 03 · ReAct — reasoning meets tool use

When the agent
needs to look something up.

// ReAct trajectory · 2 cycles
cycle 1 THOUGHT I need to find current weather in Tokyo first. ACTION get_weather(city="Tokyo") OBSERVATION {temp: 22, condition: "sunny"} cycle 2 THOUGHT Good. Now I need Paris's weather to compare. ACTION get_weather(city="Paris") OBSERVATION {temp: 14, condition: "cloudy"} FINAL ANSWER Tokyo: 22°C sunny. Paris: 14°C cloudy.

Chain-of-thought, but with tool calls mixed in.

ReAct (Yao et al., 2022) was the first framework to combine the two ideas. The model interleaves three kinds of tokens: Thought (reasoning), Action (tool call), Observation (tool result).

The genius is that each Thought can react to the previous Observation. The model isn't just emitting tool calls — it's explaining its plan, considering what it just learned, and adjusting before each call. The Thoughts are visible to the model in the next step's context, building a coherent narrative of the agent's progress.

This pattern is now the default for every serious agent system. Claude Code, Devin, and modern AutoGPT all run ReAct under the hood, often with extra structure (planning, reflection, sub-agents). The core loop is the same.

Why it works: making reasoning explicit forces consistency. Without Thoughts, an agent might call get_weather("Tokyo") then forget what it was trying to compare to. With Thoughts, the agent's "intent" is in its own context window, anchoring every next decision.

Part 04 · Hands on · The agent in action

Pick a task.
Watch it unfold.

Three real multi-step tasks. Step through one cycle at a time, or play through. Each cycle = one Thought + Action + Observation. Watch how the agent's plan evolves as new information comes in.

What you're watching.

This is the actual ReAct loop running. The Thoughts are the agent's internal reasoning (visible because we show them — production agents often render these in real-time too). The Actions are tool calls. The Observations are tool results. Notice how each Thought references what was just learned. That's how multi-step coherence happens.

// pick a task
Cycle 0 /
Pick a task above to begin
Part 05 · How agents decompose

Big task.
Smaller pieces.

When the task is small, ReAct alone is enough — think, act, observe, done. When the task is large ("refactor this whole codebase," "research this entire industry"), the agent needs to plan first, then execute. Four common decomposition strategies.

// 01

Linear plan

The agent generates a numbered list of steps before doing anything, then executes them in order. Simple, predictable. Breaks down when steps depend on findings from earlier steps (which is most real tasks).

Example plan 1. Read the README
2. Run the test suite
3. Identify failing tests
4. Read source for each failure
5. Apply fixes one by one
6. Re-run tests to confirm
// 02

Hierarchical · root + subgoals

The agent generates a tree: root task → high-level subgoals → concrete actions. Each subgoal is itself a mini-task that may spawn its own subgoals. This is how Claude Code handles complex refactors.

Example tree Goal: Add user auth
├─ Add database schema
│  ├─ Design User model
│  └─ Write migration
├─ Add login endpoint
└─ Add session middleware
// 03

Reactive · no plan

Just keep running ReAct cycles until done. No upfront plan. The agent decides each next step based purely on what it has observed so far. Most flexible — but easy to get lost in long tasks.

Example trajectory Thought → Action → Obs
Thought → Action → Obs
... (just keep going) ...
Thought → "I'm done." → Final
// 04

Tree of Thoughts

For problems with multiple viable paths (puzzles, planning, creative tasks): generate several candidate next moves at each step, evaluate them, expand only the most promising. Like beam search for reasoning.

Example expansion At step 3:
├─ Option A → est. quality 0.6
├─ Option B → est. quality 0.9 ✓
└─ Option C → est. quality 0.4
   [expand B, prune A/C]
Part 06 · Agents shipping today

The systems that are
actually working.

Theory aside, here are the agent systems people are actually using and shipping. Each picks different points on the autonomy-vs-control spectrum.

Anthropic · 2024

Claude Code

// agentic coding for terminals
ArchitectureReAct + tools + planning
Toolsread, write, bash, search, edit
Cycles per taskoften 50-200+
AutonomyAsks for confirmation on risky ops
Visible reasoningYes (extended thinking)

Runs in your terminal. Given a task ("fix this bug," "implement this feature"), reads the codebase, plans, edits files, runs tests, iterates. The clearest example of ReAct + planning at scale — often runs hundreds of cycles before declaring success.

Cognition · 2024

Devin

// fully autonomous SWE agent
ArchitectureReAct + browser + shell + IDE
DistinctiveBrowser-using agent
SWE-bench score~14% solo (when released)
GoalWhole-task autonomy
ModeAsync (works overnight)

Aimed for "give it a Jira ticket, get back a PR." More autonomous than Claude Code — runs longer-horizon tasks without check-ins. Made headlines for being the first credible end-to-end SWE agent, even if its initial benchmarks were modest.

Open-source · 2023

AutoGPT / BabyAGI

// the OG "give it a goal" agents
ArchitecturePlan + ReAct + memory
DistinctiveBuilt-in vector memory
Pattern"goal-driven" loop
StatusShowed what was possible
Lasting impactInspired everything after

Released in 2023 when GPT-4 was new. Both went viral by showing "give a goal, the agent figures it out" worked at all. Often got stuck in loops — but the agentic primitives they popularized (planning, vector memory, tool use, self-evaluation) are now standard.

OpenAI · 2024

o1 / o3 / "reasoning models"

// agentic, internalized
ArchitectureCoT trained into the model
Tokens per responseoften 1000s of "thinking"
DistinctiveReasoning happens before output
Best atMath, code, puzzles
TradeoffHigher latency + cost

Different approach: instead of orchestrating reasoning around the model, train the model to do longer reasoning natively. Pre-output CoT, often thousands of tokens of internal monologue. Claude 3.7+ "extended thinking" and Gemini's thinking models follow the same pattern.

Part 07 · Knowledge check

Five questions on what
you just orchestrated.

Aim for 4/5. Wrong answers explain themselves.

Question 01 of 05

0/5

Continue
Course 3 · Module 09 complete

The frontier of useful AI
is now mostly agentic.

You watched a ReAct loop unfold cycle by cycle. You compared four reasoning patterns side by side. You saw how planning decomposes big tasks. You understand why Claude Code can finish a feature you'd assign to a junior engineer. The mechanics are simple — Think, Act, Observe, Repeat — but applied recursively at scale, they're how AI starts shipping real work.

Up next · Course 3 · Module 10

Fine-tuning Techniques · LoRA & QLoRA

Pre-training a frontier model costs $100M+. Fine-tuning your own custom version used to cost $100k+. LoRA changed that. Train just 0.1% of parameters, get most of the benefit. QLoRA goes further — 4-bit quantization lets you fine-tune Llama-70B on a single GPU. Interactive: see why low-rank adapters work and what they cost to train.

Continue to Module 10