Mamba
Mamba is a state-space model (SSM) architecture that scales linearly with sequence length · 5-10× faster than transformers at long context · powers Jamba, Zamba, Mamba-2.
Mamba is a state-space model (SSM) architecture that scales linearly with sequence length · 5-10× faster than transformers at long context · powers Jamba, Zamba, Mamba-2.
Basic
Mamba (Albert Gu + Tri Dao, 2023) is a selective state-space model · an alternative to transformer attention. Key property: O(N) compute vs transformer's O(N²). Uses a fixed-size hidden state that "compresses" past tokens, reducing memory. Training throughput similar to transformers; inference throughput much better at long context. Mamba-2 (2024) improved the design further; hybrid models (Jamba, Zamba) combine Mamba + attention layers.
Deep
Mamba's core innovation is making state-space models content-aware · the state update depends on the input token, so the model can "selectively forget" or retain. This solves earlier SSM limitations (previous SSMs like S4 were linear, lost expressiveness). Mamba models perform competitively with transformers at small scales but quality gap at 70B+ remains unclear. Hybrid models like Jamba (AI21) mix Mamba + transformer layers to get best of both.
Expert
Mamba kernels require custom CUDA implementations (Selective Scan). Tri Dao's FlashAttention-level engineering made Mamba practical · without it, the theoretical speed advantage didn't materialize. Jamba-1.5 demonstrates Mamba+MoE+transformer hybrid at 398B (52B active) with strong long-context performance. Mamba-2 adds better structured state-space duality with attention, enabling cleaner theoretical analysis and faster implementations. Open research question: do pure Mamba models scale to frontier quality? Current evidence suggests hybrids win.
Mamba-class hybrids (Jamba, Zamba-2, some DeepSeek research) are shipping in 2026 · the first real architectural alternative to pure transformer frontier models.
Depending on why you're here
- ·Selective state-space model · O(N) compute
- ·Mamba-2 improves design · hybrids dominate production
- ·Jamba, Zamba, recent SSM-transformer hybrids
- ·Long-context efficiency gains at small scale
- ·Hybrid models easier to deploy than pure Mamba
- ·Check Jamba or Zamba if serving very long contexts
- ·First real transformer alternative · architectural risk hedge
- ·Hybrid models prove commercial viability
- ·Watch whether pure Mamba scales to frontier
- ·A newer way to build AI models · faster on long documents
- ·Some AI companies (AI21's Jamba) use it
- ·Might replace transformers someday · or not
Mamba is the first real transformer alternative to reach production · hybrid models like Jamba prove the concept even if pure Mamba still hasn't scaled to frontier.