Beta
ConceptsReading · ~3 min · 69 words deep

Mamba

Mamba is a state-space model (SSM) architecture that scales linearly with sequence length · 5-10× faster than transformers at long context · powers Jamba, Zamba, Mamba-2.

TL;DR

Mamba is a state-space model (SSM) architecture that scales linearly with sequence length · 5-10× faster than transformers at long context · powers Jamba, Zamba, Mamba-2.

Level 1

Mamba (Albert Gu + Tri Dao, 2023) is a selective state-space model · an alternative to transformer attention. Key property: O(N) compute vs transformer's O(N²). Uses a fixed-size hidden state that "compresses" past tokens, reducing memory. Training throughput similar to transformers; inference throughput much better at long context. Mamba-2 (2024) improved the design further; hybrid models (Jamba, Zamba) combine Mamba + attention layers.

Level 2

Mamba's core innovation is making state-space models content-aware · the state update depends on the input token, so the model can "selectively forget" or retain. This solves earlier SSM limitations (previous SSMs like S4 were linear, lost expressiveness). Mamba models perform competitively with transformers at small scales but quality gap at 70B+ remains unclear. Hybrid models like Jamba (AI21) mix Mamba + transformer layers to get best of both.

Level 3

Mamba kernels require custom CUDA implementations (Selective Scan). Tri Dao's FlashAttention-level engineering made Mamba practical · without it, the theoretical speed advantage didn't materialize. Jamba-1.5 demonstrates Mamba+MoE+transformer hybrid at 398B (52B active) with strong long-context performance. Mamba-2 adds better structured state-space duality with attention, enabling cleaner theoretical analysis and faster implementations. Open research question: do pure Mamba models scale to frontier quality? Current evidence suggests hybrids win.

Why this matters now

Mamba-class hybrids (Jamba, Zamba-2, some DeepSeek research) are shipping in 2026 · the first real architectural alternative to pure transformer frontier models.

The takeaway for you
If you are a
Researcher
  • ·Selective state-space model · O(N) compute
  • ·Mamba-2 improves design · hybrids dominate production
  • ·Jamba, Zamba, recent SSM-transformer hybrids
If you are a
Builder
  • ·Long-context efficiency gains at small scale
  • ·Hybrid models easier to deploy than pure Mamba
  • ·Check Jamba or Zamba if serving very long contexts
If you are a
Investor
  • ·First real transformer alternative · architectural risk hedge
  • ·Hybrid models prove commercial viability
  • ·Watch whether pure Mamba scales to frontier
If you are a
Curious · Normie
  • ·A newer way to build AI models · faster on long documents
  • ·Some AI companies (AI21's Jamba) use it
  • ·Might replace transformers someday · or not
Gecko's take

Mamba is the first real transformer alternative to reach production · hybrid models like Jamba prove the concept even if pure Mamba still hasn't scaled to frontier.

Mamba is linear-time (faster on long context), transformer is quadratic. Transformers currently dominate frontier quality; Mamba is catching up.