In a Mixture of Experts model, the feed-forward layer of each transformer block isn't one big network — it's many smaller "expert" networks running in parallel. A router decides which 1-2 experts each token visits. The rest sit idle. The result: a model with hundreds of billions of parameters, where only ~20% are used per token. Mixtral, DeepSeek-V3, and rumored GPT-4 all work this way.
Crack open MoEEvery transformer block has two big sub-blocks: attention and feed-forward (FFN). The FFN is where most parameters live (Module 1, last question). The MoE idea: replace one big FFN with many smaller FFNs ("experts") and a router that picks just 1-2 per token. You get the storage capacity of a much larger model, but pay the inference cost of a small one.
Below is a simulation of an 8-expert MoE layer. Each expert is colored and tagged by its "specialty." Type or pick a sentence — every token gets scored against all 8 experts; only the top 2 win. Different sentences activate different experts. Click any token to see its full routing scores.
Each token card shows the token at top, then 8 horizontal bars (one per expert). The 2 active experts have full-color bars; others are dim. Below each card: the names + gating weights of the 2 chosen experts. The right panel tracks cumulative utilization — how many tokens went to each expert in this sentence. Try the presets to see how routing patterns shift dramatically across domains.
The router itself is shockingly small — just one linear layer with N outputs (one per expert). Here's exactly what happens when a token hits the MoE block.
The token's hidden state h is multiplied by a small router weight matrix W_r to produce N scores (one per expert).
Take the k experts with the highest scores (typically k=2). Set the other scores to -∞. Apply softmax to the remaining k. These become the gating weights.
Run the token through ONLY those k experts. Each expert produces an output. Multiply by gating weights, sum. Done — the other N-k experts never executed.
The router is trained alongside the rest of the model. Early in training, one expert randomly gets a slight edge. The router sends more tokens to it. That expert gets more gradient signal, gets better. Even more tokens get routed there. Within hours, all 8 experts have collapsed to 1. The whole MoE advantage disappears. The fix: an auxiliary loss that penalizes uneven expert usage.
Realistic post-training expert utilization across 10,000 tokens. Without aux loss: one or two experts dominate (collapse). With aux loss: usage spreads roughly evenly. The aux loss term added to training: L_aux = α · N · Σ f_i · P_i where f_i = fraction of tokens routed to expert i, P_i = mean router probability for expert i. Pushes both terms toward uniformity.
Sparse MoE went from research idea in 2017 to production reality by 2023. Today, the most efficient frontier models are all sparse.
Matched or beat Llama 2 70B on most benchmarks while running ~5× faster. Open-weight, downloadable. Made the case to the open-source world that sparse > dense at the frontier.
Took MoE to a wild extreme: many more, smaller experts. Plus shared "always-on" experts for common patterns. Trained for ~$6M total — by far the cheapest frontier model.
OpenAI has never confirmed, but multiple credible leaks point to a 16-expert MoE. Explains why GPT-4 isn't ~50× slower than GPT-3 despite ~10× more total params: most aren't running per token.
Aim for 4/5. Wrong answers explain themselves.
You watched tokens pick their experts in real time. You felt the dense → sparse architectural leap. You understand that "parameter count" is a misleading number for MoE models — active params per token is what actually matters for inference cost. This is the architectural trick behind every fast frontier model.