ALiBi
ALiBi (Attention with Linear Biases) replaces positional embeddings with a linear bias added to attention scores · used in BLOOM, MPT, some open-source models.
ALiBi (Attention with Linear Biases) replaces positional embeddings with a linear bias added to attention scores · used in BLOOM, MPT, some open-source models.
Basic
Traditional transformers use positional embeddings (learned or RoPE). ALiBi skips that and adds a pre-computed linear bias to each attention score based on token distance · farther tokens get more negative bias. Result: no learned position embeddings, natural extrapolation to longer contexts than trained on. Used in BLOOM (176B open model), MPT series, and some Mistral derivatives.
Deep
ALiBi bias: attention_score(i,j) = qk(i,j) + m × |i - j|, where m is a head-specific negative constant. Heads get different m values, so some heads look far (small m) and some close (large m). Key property: ALiBi models extrapolate · trained on 2K context, they work decently on 8K+ without fine-tuning. This made ALiBi popular for researchers who wanted to test long-context without retraining.
Expert
ALiBi vs RoPE: RoPE has become the dominant choice (Llama, Mistral, Qwen, Gemma all use RoPE or variants). RoPE has better expressiveness · ALiBi's linear bias is coarse. But ALiBi's extrapolation property is useful for long-context exploration. Hybrid approaches (RoPE + ALiBi-style bias) appear in some research. BLOOM's 176B release was the most-referenced ALiBi production model; newer work largely moved to RoPE.
Depending on why you're here
- ·Bias-based positional encoding · no embeddings
- ·Natural context-length extrapolation
- ·Used in BLOOM, MPT, early long-context research
- ·Largely superseded by RoPE
- ·Useful for extrapolation experiments
- ·Check if your base model uses ALiBi or RoPE · impacts context scaling
- ·Historical · RoPE won the position-encoding war
- ·Matters for understanding open-source model lineage
- ·Some Chinese research groups still explore ALiBi
- ·A way AI models track word order without a separate index
- ·Used in older open-source AI models
- ·Newer models use a different method (RoPE)
ALiBi is a losing bet that RoPE won · matters for understanding why long-context models work differently than you'd expect.