LAMBADA
LAMBADA · measures the ability to predict the final word of a passage, requiring broad contextual understanding across long text spans.
The Frontier
Best score over time · one chart, every benchmark
Full rankings
7 models tested · sorted by score
| # | Model | Score |
|---|---|---|
| 1 | 79.8 | |
| 2 | 76.5 | |
| 3 | 75.2 | |
| 4 | U Baichuan 2-7B | 73.3 |
| 5 | U Stable Beluga 2 | 71.3 |
| 6 | 71.1 | |
| 7 | U MPT-30B | 70.0 |
Score distribution
Where models cluster
Correlated benchmarks
Pearson r · original research
Benchmarks that track with LAMBADA
Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.
Frequently asked
About LAMBADA
What does LAMBADA measure?
LAMBADA · measures the ability to predict the final word of a passage, requiring broad contextual understanding across long text spans. 7 AI models have been tested on it. Scores range from 70.0 to 79.8 out of 100.
Which model leads on LAMBADA?
Falcon-180B from TII leads LAMBADA with a score of 79.8. The median score across 7 tested models is 73.3.
Is LAMBADA saturated?
No · the top score is 79.8 out of 100 (80%). There is still meaningful room for improvement on LAMBADA.
Does LAMBADA predict performance on other benchmarks?
Yes · LAMBADA scores correlate 0.51 with HellaSwag across 6 shared models. Models that do well on LAMBADA tend to do well on HellaSwag.
How often is LAMBADA data refreshed?
BenchGecko pulls updates daily. New model scores on LAMBADA appear as soon as they are published by Epoch AI or the model provider.
- Category
- Knowledge
- Max score
- 100
- Models
- 7
- Updated
- 2026-05-15
Top on LAMBADA
Falcon-180B · 79.8Llama 2-13B · 76.5LLaMA-13B · 75.2Baichuan 2-7B · 73.3Stable Beluga 2 · 71.3More knowledge benchmarks
Same category · related evaluations