Which models use ALiBi?

BLOOM 176B, MPT series, some research models. Most modern LLMs use RoPE variants.

Can I switch from RoPE to ALiBi?

Not without retraining. Position encoding is baked into model weights.

ConceptsALiBiReading · ~3 min · 65 words deep

ALiBi

ALiBi (Attention with Linear Biases) replaces positional embeddings with a linear bias added to attention scores · used in BLOOM, MPT, some open-source models.

TL;DR

ALiBi (Attention with Linear Biases) replaces positional embeddings with a linear bias added to attention scores · used in BLOOM, MPT, some open-source models.

Level 1

Basic

Traditional transformers use positional embeddings (learned or RoPE). ALiBi skips that and adds a pre-computed linear bias to each attention score based on token distance · farther tokens get more negative bias. Result: no learned position embeddings, natural extrapolation to longer contexts than trained on. Used in BLOOM (176B open model), MPT series, and some Mistral derivatives.

Level 2

Deep

ALiBi bias: attention_score(i,j) = qk(i,j) + m × |i - j|, where m is a head-specific negative constant. Heads get different m values, so some heads look far (small m) and some close (large m). Key property: ALiBi models extrapolate · trained on 2K context, they work decently on 8K+ without fine-tuning. This made ALiBi popular for researchers who wanted to test long-context without retraining.

Level 3

Expert

ALiBi vs RoPE: RoPE has become the dominant choice (Llama, Mistral, Qwen, Gemma all use RoPE or variants). RoPE has better expressiveness · ALiBi's linear bias is coarse. But ALiBi's extrapolation property is useful for long-context exploration. Hybrid approaches (RoPE + ALiBi-style bias) appear in some research. BLOOM's 176B release was the most-referenced ALiBi production model; newer work largely moved to RoPE.

The takeaway for you

Depending on why you're here

If you are a

Researcher

·Bias-based positional encoding · no embeddings
·Natural context-length extrapolation
·Used in BLOOM, MPT, early long-context research

If you are a

Builder

·Largely superseded by RoPE
·Useful for extrapolation experiments
·Check if your base model uses ALiBi or RoPE · impacts context scaling

If you are a

Investor

·Historical · RoPE won the position-encoding war
·Matters for understanding open-source model lineage
·Some Chinese research groups still explore ALiBi

If you are a

Curious · Normie

·A way AI models track word order without a separate index
·Used in older open-source AI models
·Newer models use a different method (RoPE)

Gecko's take

ALiBi is a losing bet that RoPE won · matters for understanding why long-context models work differently than you'd expect.

Frequently Asked Questions

RoPE has won most production deployments. ALiBi's key advantage is extrapolation to longer contexts than trained on.

ALiBi

Basic

Deep

Expert

Depending on why you're here

Frequently Asked Questions

Related terms

Glossary

Explore live data

Cite or embed