Beta
ConceptsMoEReading · ~3 min · 104 words deep

Mixture of Experts

A model architecture where only a subset of experts activate per token, slashing inference cost while preserving quality.

TL;DR

A model architecture where only a subset of experts activate per token, slashing inference cost while preserving quality.

Level 1

Instead of running every parameter for every token, MoE routes each token to a small group of specialist sub-networks called experts. DeepSeek V3 uses 671B total parameters but only activates 37B per token. You get frontier quality at a fraction of the compute cost.

Level 2

In a dense transformer, every parameter fires for every token. In a Mixture-of-Experts transformer, the feedforward layer is replaced by a bank of experts and a router. For each token the router picks the top-k experts (usually 2 out of 64 or more) and only those experts compute. DeepSeek V3 activates 37B of 671B parameters per token, which is why it prices at $0.27/M output versus GPT-5 at $10/M. The trade-offs: load balancing loss to keep experts busy, routing instability during training, and higher total VRAM at serving time because all experts must be resident even if only a few are used per token.

Level 3

MoE replaces the FFN of each transformer block with N expert MLPs plus a gating network g(x) that produces a sparse top-k distribution. The forward pass becomes sum over i in top_k(g(x)) of g(x)_i * E_i(x). Auxiliary losses balance expert load (load_balancing_loss) and prevent gate dropout. Frontier MoE models push expert count to 256+ (DeepSeek V3 uses 256 routed experts + 1 shared expert). Sparsity ratio (active/total params) typically lands between 5% and 15%. Training MoE requires expert parallelism on top of tensor parallelism, and serving benefits from fused kernels like GroupedGEMM. Quality recovery relative to a dense model of the same active size is the empirical win: DeepSeek V3 at 37B active matches Llama 3.1 405B dense on most benchmarks.

Why this matters now

DeepSeek V3.2 ships tomorrow on MoE for 1/30th the price of GPT-5. Every frontier lab is shipping MoE in 2026.

The takeaway for you
If you are a
Researcher
  • ·Sparse activation with gated routing, aux loss for load balancing
  • ·Scaling law differs from dense: active params matter for quality, total params matter for breadth
  • ·See DeepSeek V3 paper (Dec 2024) and Switch Transformer (2021)
If you are a
Builder
  • ·MoE models route each token to a subset of experts · output behaves like a normal LLM
  • ·Pay per active parameter, not total · DeepSeek V3 ≈ $0.27/M output
  • ·All experts must be in VRAM at serve time · requires multi-GPU setups
If you are a
Investor
  • ·MoE crashes inference cost · compute-per-dollar dropping ~10× per year
  • ·Signals frontier-capability at low price point · disrupts pure API revenue margins
  • ·Capex story shifts from more GPUs to smarter architectures
If you are a
Curious · Normie
  • ·Like a hospital: you see the specialist for your problem, not every doctor
  • ·Models stay smart while becoming cheap to run
  • ·DeepSeek is the poster child · GPT-5 quality for pennies
Don't mix them up
Mixture of ExpertsvsDense model

Dense = every parameter fires per token. MoE = sparse subset fires per token.

Mixture of ExpertsvsEnsemble

Ensembles combine independent models at inference. MoE is one model with internal routing during forward pass.

Gecko's take

MoE is why the AI price floor just dropped by 30×. Any model that isn't MoE by end of 2026 will be priced out of the commodity tier.

The price of knowing this term

Choosing DeepSeek V3.2 (MoE) over GPT-5 at 10M tokens/month saves ~$980/month. At 100M tokens/month, ~$9,800/month.

DeepSeek V3 uses 256 routed experts plus 1 shared expert. Each token activates 8 of the 256 routed experts via top-k gating.
Canonical sources