Tool use (Module 8) gave the model hands. Multi-step reasoning gives it persistence. The patterns in this module — chain-of-thought, ReAct, self-consistency, reflection, planning — are what turn a single tool call into a 50-step autonomous task. They're the difference between Claude answering a question and Claude Code shipping a feature.
Show your workIn 2022, Wei et al. discovered that simply prompting the model to "think step by step" before answering massively improved multi-step problems. Math word problems. Logic puzzles. Reading comprehension. Adding nine words to the prompt sometimes doubled accuracy.
The mechanism: without CoT, the model has to compute the final answer in a single forward pass. With CoT, it generates intermediate tokens that themselves become context for the next tokens. Each step of reasoning gets to attend to all the previous reasoning. The model effectively gets more "thinking time."
Modern frontier models do this implicitly — they've been trained to chain-of-thought even when not explicitly prompted. Reasoning models like o1, o3, and Claude 3.7's extended thinking take this further: thousands of internal reasoning tokens before the user sees any output.
This insight — that intermediate generation helps — is the foundation that all later patterns build on. ReAct, reflection, self-consistency — all of them are CoT with extra structure.
Once you accept that "intermediate generation helps," the next question is what kind of intermediate generation. Below: four reasoning patterns applied to the same multi-step math problem. Click any pattern to see its structure and full trace.
Each pattern produces different output, takes different time, costs different money. Direct is fastest but unreliable. CoT is the default. Self-Consistency trades 5-10× cost for the highest accuracy. Reflection catches errors but doubles latency. No pattern is universally best — the right choice depends on the stakes.
ReAct (Yao et al., 2022) was the first framework to combine the two ideas. The model interleaves three kinds of tokens: Thought (reasoning), Action (tool call), Observation (tool result).
The genius is that each Thought can react to the previous Observation. The model isn't just emitting tool calls — it's explaining its plan, considering what it just learned, and adjusting before each call. The Thoughts are visible to the model in the next step's context, building a coherent narrative of the agent's progress.
This pattern is now the default for every serious agent system. Claude Code, Devin, and modern AutoGPT all run ReAct under the hood, often with extra structure (planning, reflection, sub-agents). The core loop is the same.
Why it works: making reasoning explicit forces consistency. Without Thoughts, an agent might call get_weather("Tokyo") then forget what it was trying to compare to. With Thoughts, the agent's "intent" is in its own context window, anchoring every next decision.
Three real multi-step tasks. Step through one cycle at a time, or play through. Each cycle = one Thought + Action + Observation. Watch how the agent's plan evolves as new information comes in.
This is the actual ReAct loop running. The Thoughts are the agent's internal reasoning (visible because we show them — production agents often render these in real-time too). The Actions are tool calls. The Observations are tool results. Notice how each Thought references what was just learned. That's how multi-step coherence happens.
When the task is small, ReAct alone is enough — think, act, observe, done. When the task is large ("refactor this whole codebase," "research this entire industry"), the agent needs to plan first, then execute. Four common decomposition strategies.
The agent generates a numbered list of steps before doing anything, then executes them in order. Simple, predictable. Breaks down when steps depend on findings from earlier steps (which is most real tasks).
The agent generates a tree: root task → high-level subgoals → concrete actions. Each subgoal is itself a mini-task that may spawn its own subgoals. This is how Claude Code handles complex refactors.
Just keep running ReAct cycles until done. No upfront plan. The agent decides each next step based purely on what it has observed so far. Most flexible — but easy to get lost in long tasks.
For problems with multiple viable paths (puzzles, planning, creative tasks): generate several candidate next moves at each step, evaluate them, expand only the most promising. Like beam search for reasoning.
Theory aside, here are the agent systems people are actually using and shipping. Each picks different points on the autonomy-vs-control spectrum.
Runs in your terminal. Given a task ("fix this bug," "implement this feature"), reads the codebase, plans, edits files, runs tests, iterates. The clearest example of ReAct + planning at scale — often runs hundreds of cycles before declaring success.
Aimed for "give it a Jira ticket, get back a PR." More autonomous than Claude Code — runs longer-horizon tasks without check-ins. Made headlines for being the first credible end-to-end SWE agent, even if its initial benchmarks were modest.
Released in 2023 when GPT-4 was new. Both went viral by showing "give a goal, the agent figures it out" worked at all. Often got stuck in loops — but the agentic primitives they popularized (planning, vector memory, tool use, self-evaluation) are now standard.
Different approach: instead of orchestrating reasoning around the model, train the model to do longer reasoning natively. Pre-output CoT, often thousands of tokens of internal monologue. Claude 3.7+ "extended thinking" and Gemini's thinking models follow the same pattern.
Aim for 4/5. Wrong answers explain themselves.
You watched a ReAct loop unfold cycle by cycle. You compared four reasoning patterns side by side. You saw how planning decomposes big tasks. You understand why Claude Code can finish a feature you'd assign to a junior engineer. The mechanics are simple — Think, Act, Observe, Repeat — but applied recursively at scale, they're how AI starts shipping real work.
Continue to Module 10