Beta
ConceptsALiBiReading · ~3 min · 65 words deep

ALiBi

ALiBi (Attention with Linear Biases) replaces positional embeddings with a linear bias added to attention scores · used in BLOOM, MPT, some open-source models.

TL;DR

ALiBi (Attention with Linear Biases) replaces positional embeddings with a linear bias added to attention scores · used in BLOOM, MPT, some open-source models.

Level 1

Traditional transformers use positional embeddings (learned or RoPE). ALiBi skips that and adds a pre-computed linear bias to each attention score based on token distance · farther tokens get more negative bias. Result: no learned position embeddings, natural extrapolation to longer contexts than trained on. Used in BLOOM (176B open model), MPT series, and some Mistral derivatives.

Level 2

ALiBi bias: attention_score(i,j) = qk(i,j) + m × |i - j|, where m is a head-specific negative constant. Heads get different m values, so some heads look far (small m) and some close (large m). Key property: ALiBi models extrapolate · trained on 2K context, they work decently on 8K+ without fine-tuning. This made ALiBi popular for researchers who wanted to test long-context without retraining.

Level 3

ALiBi vs RoPE: RoPE has become the dominant choice (Llama, Mistral, Qwen, Gemma all use RoPE or variants). RoPE has better expressiveness · ALiBi's linear bias is coarse. But ALiBi's extrapolation property is useful for long-context exploration. Hybrid approaches (RoPE + ALiBi-style bias) appear in some research. BLOOM's 176B release was the most-referenced ALiBi production model; newer work largely moved to RoPE.

The takeaway for you
If you are a
Researcher
  • ·Bias-based positional encoding · no embeddings
  • ·Natural context-length extrapolation
  • ·Used in BLOOM, MPT, early long-context research
If you are a
Builder
  • ·Largely superseded by RoPE
  • ·Useful for extrapolation experiments
  • ·Check if your base model uses ALiBi or RoPE · impacts context scaling
If you are a
Investor
  • ·Historical · RoPE won the position-encoding war
  • ·Matters for understanding open-source model lineage
  • ·Some Chinese research groups still explore ALiBi
If you are a
Curious · Normie
  • ·A way AI models track word order without a separate index
  • ·Used in older open-source AI models
  • ·Newer models use a different method (RoPE)
Gecko's take

ALiBi is a losing bet that RoPE won · matters for understanding why long-context models work differently than you'd expect.

RoPE has won most production deployments. ALiBi's key advantage is extrapolation to longer contexts than trained on.