AI Skill Course Course 3 · Expert
Module 04 of 14
Course 3 · Module 04 · 75 minutes

Not one network.
Eight networks.
Most asleep.

In a Mixture of Experts model, the feed-forward layer of each transformer block isn't one big network — it's many smaller "expert" networks running in parallel. A router decides which 1-2 experts each token visits. The rest sit idle. The result: a model with hundreds of billions of parameters, where only ~20% are used per token. Mixtral, DeepSeek-V3, and rumored GPT-4 all work this way.

You'll watch
Tokens get routed live
You'll feel
Sparse activation
You'll see
Why GPT-4 is fast
Crack open MoE
Sparse routing "sat" router E1 E2 E3 E4 E5 E6 E7 E8 8 experts available · only 2 active → 75% of params idle for this token 47B total params but only 13B active per token
Part 01 · The motivation

More capacity.
Same compute.

Every transformer block has two big sub-blocks: attention and feed-forward (FFN). The FFN is where most parameters live (Module 1, last question). The MoE idea: replace one big FFN with many smaller FFNs ("experts") and a router that picks just 1-2 per token. You get the storage capacity of a much larger model, but pay the inference cost of a small one.

// Dense model

Every parameter, every token

Standard transformer FFN
token Big FFN 100% active ~70B params used
Param scalingexpensive
Inference costscales with all params
ExamplesGPT-3, Llama 3 70B
// Sparse · MoE

Only the relevant experts run

Mixture of Experts FFN
token router E E 2 of 8 active · ~25% params used
Param scalingcheap
Inference costscales with active params only
ExamplesMixtral, DeepSeek-V3, GPT-4(?)
Part 02 · Hands on · Live routing

Type a sentence.
Watch each token find its experts.

Below is a simulation of an 8-expert MoE layer. Each expert is colored and tagged by its "specialty." Type or pick a sentence — every token gets scored against all 8 experts; only the top 2 win. Different sentences activate different experts. Click any token to see its full routing scores.

How to read it.

Each token card shows the token at top, then 8 horizontal bars (one per expert). The 2 active experts have full-color bars; others are dim. Below each card: the names + gating weights of the 2 chosen experts. The right panel tracks cumulative utilization — how many tokens went to each expert in this sentence. Try the presets to see how routing patterns shift dramatically across domains.

E1 · Code
E2 · Math
E3 · Punctuation
E4 · Common words
E5 · Narrative
E6 · Technical
E7 · Entities
E8 · Grammar
// Try a preset
// Per-token routing
// Expert utilization

For this sentence

// Click any token above
Part 03 · How routing actually works

Three steps.
Two simple layers.

The router itself is shockingly small — just one linear layer with N outputs (one per expert). Here's exactly what happens when a token hits the MoE block.

01

Score every expert

Router linear layer

The token's hidden state h is multiplied by a small router weight matrix W_r to produce N scores (one per expert).

scores = h · Wr
shape: [1 × N]
02

Pick the top-k

Sparse selection

Take the k experts with the highest scores (typically k=2). Set the other scores to -∞. Apply softmax to the remaining k. These become the gating weights.

g = softmax(topk(scores))
only k values non-zero
03

Run + combine

Weighted sum of expert outputs

Run the token through ONLY those k experts. Each expert produces an output. Multiply by gating weights, sum. Done — the other N-k experts never executed.

out = Σ gi · Experti(h)
only i ∈ top-k
Part 04 · Hands on · The collapse problem

Without load balancing,
one expert wins everything.

The router is trained alongside the rest of the model. Early in training, one expert randomly gets a slight edge. The router sends more tokens to it. That expert gets more gradient signal, gets better. Even more tokens get routed there. Within hours, all 8 experts have collapsed to 1. The whole MoE advantage disappears. The fix: an auxiliary loss that penalizes uneven expert usage.

What you're seeing.

Realistic post-training expert utilization across 10,000 tokens. Without aux loss: one or two experts dominate (collapse). With aux loss: usage spreads roughly evenly. The aux loss term added to training: L_aux = α · N · Σ f_i · P_i where f_i = fraction of tokens routed to expert i, P_i = mean router probability for expert i. Pushes both terms toward uniformity.

All 8 experts learn distinct specialties · model behaves as designed · ~95% of capacity utilized
Collapsed routing
// what happens without aux loss
The collapse E5 received 64% of tokens. E1, E3, E7 received less than 2% each. You effectively trained one expert at the cost of 8 — same compute, far worse model.
Balanced routing
// what happens with aux loss
Healthy specialization Every expert handles 10-15% of tokens. Each learns a distinct specialty. All 8 experts contribute meaningfully — full sparse-MoE capacity utilization.
Part 05 · MoE in the wild

The models you use
are already doing this.

Sparse MoE went from research idea in 2017 to production reality by 2023. Today, the most efficient frontier models are all sparse.

Mistral · 2023

Mixtral 8x7B

// the open-source landmark
Total params47B
Active per token~13B
Experts8 per layer
Top-kk=2
Inference costlike 13B

Matched or beat Llama 2 70B on most benchmarks while running ~5× faster. Open-weight, downloadable. Made the case to the open-source world that sparse > dense at the frontier.

DeepSeek · 2024

DeepSeek-V3

// fine-grained MoE at scale
Total params671B
Active per token~37B
Experts256 per layer
Top-kk=8
Inference costlike 37B

Took MoE to a wild extreme: many more, smaller experts. Plus shared "always-on" experts for common patterns. Trained for ~$6M total — by far the cheapest frontier model.

OpenAI · 2023 (rumored)

GPT-4

// the suspected architecture
Total params (est)~1.8T
Active per token (est)~280B
Experts (est)16 per layer
Top-k (est)k=2
Inference costlike ~280B

OpenAI has never confirmed, but multiple credible leaks point to a 16-expert MoE. Explains why GPT-4 isn't ~50× slower than GPT-3 despite ~10× more total params: most aren't running per token.

Part 06 · Knowledge check

Five questions on what
you just routed.

Aim for 4/5. Wrong answers explain themselves.

Question 01 of 05

0/5

Continue
Course 3 · Module 04 complete

Now you see why
GPT-4 is so fast.

You watched tokens pick their experts in real time. You felt the dense → sparse architectural leap. You understand that "parameter count" is a misleading number for MoE models — active params per token is what actually matters for inference cost. This is the architectural trick behind every fast frontier model.

Up next · Course 3 · Module 05

Diffusion Models

Time to switch from text to images. How does Stable Diffusion turn pure noise into a photograph? The forward process (slowly add noise to an image), the reverse process (slowly remove it), latent diffusion (do it in compressed space), and classifier-free guidance (how prompts steer the denoising). Interactive: watch noise become an image, step by step.

Continue to Module 05