Course 3 · Module 04 · 75 minutes

Not one network.
Eight networks.
Most asleep.

In a Mixture of Experts model, the feed-forward layer of each transformer block isn't one big network — it's many smaller "expert" networks running in parallel. A router decides which 1-2 experts each token visits. The rest sit idle. The result: a model with hundreds of billions of parameters, where only ~20% are used per token. Mixtral, DeepSeek-V3, and rumored GPT-4 all work this way.

You'll watch

Tokens get routed live

You'll feel

Sparse activation

You'll see

Why GPT-4 is fast

Crack open MoE

Part 01 · The motivation

More capacity.
Same compute.

Every transformer block has two big sub-blocks: attention and feed-forward (FFN). The FFN is where most parameters live (Module 1, last question). The MoE idea: replace one big FFN with many smaller FFNs ("experts") and a router that picks just 1-2 per token. You get the storage capacity of a much larger model, but pay the inference cost of a small one.

// Dense model

Every parameter, every token

Standard transformer FFN

Param scalingexpensive

Inference costscales with all params

ExamplesGPT-3, Llama 3 70B

→

// Sparse · MoE

Only the relevant experts run

Mixture of Experts FFN

Param scalingcheap

Inference costscales with active params only

ExamplesMixtral, DeepSeek-V3, GPT-4(?)

Part 02 · Hands on · Live routing

Type a sentence.
Watch each token find its experts.

Below is a simulation of an 8-expert MoE layer. Each expert is colored and tagged by its "specialty." Type or pick a sentence — every token gets scored against all 8 experts; only the top 2 win. Different sentences activate different experts. Click any token to see its full routing scores.

How to read it.

Each token card shows the token at top, then 8 horizontal bars (one per expert). The 2 active experts have full-color bars; others are dim. Below each card: the names + gating weights of the 2 chosen experts. The right panel tracks cumulative utilization — how many tokens went to each expert in this sentence. Try the presets to see how routing patterns shift dramatically across domains.

E1 · Code

E2 · Math

E3 · Punctuation

E4 · Common words

E5 · Narrative

E6 · Technical

E7 · Entities

E8 · Grammar

// Try a preset

// Per-token routing

// Expert utilization

For this sentence

// Click any token above

—

Part 03 · How routing actually works

Three steps.
Two simple layers.

The router itself is shockingly small — just one linear layer with N outputs (one per expert). Here's exactly what happens when a token hits the MoE block.

Score every expert

Router linear layer

The token's hidden state h is multiplied by a small router weight matrix W_r to produce N scores (one per expert).

scores = h · W_r
shape: [1 × N]

Pick the top-k

Sparse selection

Take the k experts with the highest scores (typically k=2). Set the other scores to -∞. Apply softmax to the remaining k. These become the gating weights.

g = softmax(top_k(scores))
only k values non-zero

Run + combine

Weighted sum of expert outputs

Run the token through ONLY those k experts. Each expert produces an output. Multiply by gating weights, sum. Done — the other N-k experts never executed.

out = Σ g_i · Expert_i(h)
only i ∈ top-k

Part 04 · Hands on · The collapse problem

Without load balancing,
one expert wins everything.

The router is trained alongside the rest of the model. Early in training, one expert randomly gets a slight edge. The router sends more tokens to it. That expert gets more gradient signal, gets better. Even more tokens get routed there. Within hours, all 8 experts have collapsed to 1. The whole MoE advantage disappears. The fix: an auxiliary loss that penalizes uneven expert usage.

What you're seeing.

Realistic post-training expert utilization across 10,000 tokens. Without aux loss: one or two experts dominate (collapse). With aux loss: usage spreads roughly evenly. The aux loss term added to training: L_aux = α · N · Σ f_i · P_i where f_i = fraction of tokens routed to expert i, P_i = mean router probability for expert i. Pushes both terms toward uniformity.

All 8 experts learn distinct specialties · model behaves as designed · ~95% of capacity utilized

Collapsed routing

// what happens without aux loss

The collapse E5 received 64% of tokens. E1, E3, E7 received less than 2% each. You effectively trained one expert at the cost of 8 — same compute, far worse model.

Balanced routing

// what happens with aux loss

Healthy specialization Every expert handles 10-15% of tokens. Each learns a distinct specialty. All 8 experts contribute meaningfully — full sparse-MoE capacity utilization.

Part 05 · MoE in the wild

The models you use
are already doing this.

Sparse MoE went from research idea in 2017 to production reality by 2023. Today, the most efficient frontier models are all sparse.

Mistral · 2023

Mixtral 8x7B

// the open-source landmark

Total params47B

Active per token~13B

Experts8 per layer

Top-kk=2

Inference costlike 13B

Matched or beat Llama 2 70B on most benchmarks while running ~5× faster. Open-weight, downloadable. Made the case to the open-source world that sparse > dense at the frontier.

DeepSeek · 2024

DeepSeek-V3

// fine-grained MoE at scale

Total params671B

Active per token~37B

Experts256 per layer

Top-kk=8

Inference costlike 37B

Took MoE to a wild extreme: many more, smaller experts. Plus shared "always-on" experts for common patterns. Trained for ~$6M total — by far the cheapest frontier model.

OpenAI · 2023 (rumored)

GPT-4

// the suspected architecture

Total params (est)~1.8T

Active per token (est)~280B

Experts (est)16 per layer

Top-k (est)k=2

Inference costlike ~280B

OpenAI has never confirmed, but multiple credible leaks point to a 16-expert MoE. Explains why GPT-4 isn't ~50× slower than GPT-3 despite ~10× more total params: most aren't running per token.

Course 3 · Module 04 complete

Now you see why
GPT-4 is so fast.

You watched tokens pick their experts in real time. You felt the dense → sparse architectural leap. You understand that "parameter count" is a misleading number for MoE models — active params per token is what actually matters for inference cost. This is the architectural trick behind every fast frontier model.

Up next · Course 3 · Module 05

Diffusion Models

Time to switch from text to images. How does Stable Diffusion turn pure noise into a photograph? The forward process (slowly add noise to an image), the reverse process (slowly remove it), latent diffusion (do it in compressed space), and classifier-free guidance (how prompts steer the denoising). Interactive: watch noise become an image, step by step.

Continue to Module 05

Not one network.Eight networks.Most asleep.

More capacity.Same compute.