Which production models use Mamba?

AI21 Jamba, Zyphra Zamba-2, some open research models. Frontier frontier labs have not adopted pure Mamba yet.

Is Mamba going to replace transformers?

Uncertain. Hybrids are winning now. Pure Mamba at 100B+ scale has not been publicly demonstrated to match transformers.

ConceptsReading · ~3 min · 69 words deep

Mamba

Mamba is a state-space model (SSM) architecture that scales linearly with sequence length · 5-10× faster than transformers at long context · powers Jamba, Zamba, Mamba-2.

TL;DR

Mamba is a state-space model (SSM) architecture that scales linearly with sequence length · 5-10× faster than transformers at long context · powers Jamba, Zamba, Mamba-2.

Level 1

Basic

Mamba (Albert Gu + Tri Dao, 2023) is a selective state-space model · an alternative to transformer attention. Key property: O(N) compute vs transformer's O(N²). Uses a fixed-size hidden state that "compresses" past tokens, reducing memory. Training throughput similar to transformers; inference throughput much better at long context. Mamba-2 (2024) improved the design further; hybrid models (Jamba, Zamba) combine Mamba + attention layers.

Level 2

Deep

Mamba's core innovation is making state-space models content-aware · the state update depends on the input token, so the model can "selectively forget" or retain. This solves earlier SSM limitations (previous SSMs like S4 were linear, lost expressiveness). Mamba models perform competitively with transformers at small scales but quality gap at 70B+ remains unclear. Hybrid models like Jamba (AI21) mix Mamba + transformer layers to get best of both.

Level 3

Expert

Mamba kernels require custom CUDA implementations (Selective Scan). Tri Dao's FlashAttention-level engineering made Mamba practical · without it, the theoretical speed advantage didn't materialize. Jamba-1.5 demonstrates Mamba+MoE+transformer hybrid at 398B (52B active) with strong long-context performance. Mamba-2 adds better structured state-space duality with attention, enabling cleaner theoretical analysis and faster implementations. Open research question: do pure Mamba models scale to frontier quality? Current evidence suggests hybrids win.

Why this matters now

Mamba-class hybrids (Jamba, Zamba-2, some DeepSeek research) are shipping in 2026 · the first real architectural alternative to pure transformer frontier models.

The takeaway for you

Depending on why you're here

If you are a

Researcher

·Selective state-space model · O(N) compute
·Mamba-2 improves design · hybrids dominate production
·Jamba, Zamba, recent SSM-transformer hybrids

If you are a

Builder

·Long-context efficiency gains at small scale
·Hybrid models easier to deploy than pure Mamba
·Check Jamba or Zamba if serving very long contexts

If you are a

Investor

·First real transformer alternative · architectural risk hedge
·Hybrid models prove commercial viability
·Watch whether pure Mamba scales to frontier

If you are a

Curious · Normie

·A newer way to build AI models · faster on long documents
·Some AI companies (AI21's Jamba) use it
·Might replace transformers someday · or not

Gecko's take

Mamba is the first real transformer alternative to reach production · hybrid models like Jamba prove the concept even if pure Mamba still hasn't scaled to frontier.

Frequently Asked Questions

Mamba is linear-time (faster on long context), transformer is quadratic. Transformers currently dominate frontier quality; Mamba is catching up.

Mamba

Basic

Deep

Expert

Depending on why you're here

Frequently Asked Questions

Related terms

Glossary

Explore live data

Cite or embed