Is MoE cheaper to train?

Not per step, but per quality unit yes. MoE reaches dense-model quality at lower total FLOPS because the sparse routing lets the model specialize capacity per input.

Why does GPT-5 likely use MoE?

OpenAI has not confirmed, but the pricing gap between GPT-5 and its active-parameter-equivalent open-weight peers suggests sparse architecture.

Does MoE affect output quality?

No. To the end user, output feels identical to a dense model. The architecture difference is entirely internal.

What is the downside of MoE?

VRAM. All experts must be loaded even if only some are used per token. A 671B MoE serves from ~800GB HBM even though only 37B are active per token.

ConceptsMoEReading · ~3 min · 104 words deep

Mixture of Experts

A model architecture where only a subset of experts activate per token, slashing inference cost while preserving quality.

TL;DR

A model architecture where only a subset of experts activate per token, slashing inference cost while preserving quality.

Level 1

Basic

Instead of running every parameter for every token, MoE routes each token to a small group of specialist sub-networks called experts. DeepSeek V3 uses 671B total parameters but only activates 37B per token. You get frontier quality at a fraction of the compute cost.

Level 2

Deep

In a dense transformer, every parameter fires for every token. In a Mixture-of-Experts transformer, the feedforward layer is replaced by a bank of experts and a router. For each token the router picks the top-k experts (usually 2 out of 64 or more) and only those experts compute. DeepSeek V3 activates 37B of 671B parameters per token, which is why it prices at $0.27/M output versus GPT-5 at $10/M. The trade-offs: load balancing loss to keep experts busy, routing instability during training, and higher total VRAM at serving time because all experts must be resident even if only a few are used per token.

Level 3

Expert

MoE replaces the FFN of each transformer block with N expert MLPs plus a gating network g(x) that produces a sparse top-k distribution. The forward pass becomes sum over i in top_k(g(x)) of g(x)_i * E_i(x). Auxiliary losses balance expert load (load_balancing_loss) and prevent gate dropout. Frontier MoE models push expert count to 256+ (DeepSeek V3 uses 256 routed experts + 1 shared expert). Sparsity ratio (active/total params) typically lands between 5% and 15%. Training MoE requires expert parallelism on top of tensor parallelism, and serving benefits from fused kernels like GroupedGEMM. Quality recovery relative to a dense model of the same active size is the empirical win: DeepSeek V3 at 37B active matches Llama 3.1 405B dense on most benchmarks.

Why this matters now

DeepSeek V3.2 ships tomorrow on MoE for 1/30th the price of GPT-5. Every frontier lab is shipping MoE in 2026.

The takeaway for you

Depending on why you're here

If you are a

Researcher

·Sparse activation with gated routing, aux loss for load balancing
·Scaling law differs from dense: active params matter for quality, total params matter for breadth
·See DeepSeek V3 paper (Dec 2024) and Switch Transformer (2021)

If you are a

Builder

·MoE models route each token to a subset of experts · output behaves like a normal LLM
·Pay per active parameter, not total · DeepSeek V3 ≈ $0.27/M output
·All experts must be in VRAM at serve time · requires multi-GPU setups

If you are a

Investor

·MoE crashes inference cost · compute-per-dollar dropping ~10× per year
·Signals frontier-capability at low price point · disrupts pure API revenue margins
·Capex story shifts from more GPUs to smarter architectures

If you are a

Curious · Normie

·Like a hospital: you see the specialist for your problem, not every doctor
·Models stay smart while becoming cheap to run
·DeepSeek is the poster child · GPT-5 quality for pennies

Don't mix them up

Often confused with

Mixture of ExpertsvsDense model

Dense = every parameter fires per token. MoE = sparse subset fires per token.

Mixture of ExpertsvsEnsemble

Ensembles combine independent models at inference. MoE is one model with internal routing during forward pass.

Gecko's take

MoE is why the AI price floor just dropped by 30×. Any model that isn't MoE by end of 2026 will be priced out of the commodity tier.

The price of knowing this term

Choosing DeepSeek V3.2 (MoE) over GPT-5 at 10M tokens/month saves ~$980/month. At 100M tokens/month, ~$9,800/month.

Frequently Asked Questions

DeepSeek V3 uses 256 routed experts plus 1 shared expert. Each token activates 8 of the 256 routed experts via top-k gating.

Canonical sources

Read the primary sources

DeepSeek V3 paperarxiv.org
Switch Transformer (Google 2021)arxiv.org
Outrageously Large Neural Networks (original MoE, 2017)arxiv.org

Mixture of Experts

Basic

Deep

Expert

Depending on why you're here

Often confused with

Frequently Asked Questions

Read the primary sources

Related terms

Glossary

Explore live data

Cite or embed