Balrog
Balrog · benchmarks AI agents on text-based adventure games, testing language understanding, strategic planning, and long-horizon reasoning.
The Frontier
Best score over time · one chart, every benchmark
Full rankings
22 models tested · sorted by score
| # | Model | Score |
|---|---|---|
| 1 | 48.1 | |
| 2 | 43.6 | |
| 3 | 43.3 | |
| 4 | 34.9 | |
| 5 | 33.5 | |
| 6 | 32.8 | |
| 7 | 32.6 | |
| 8 | 32.3 | |
| 9 | 32.3 | |
| 10 | 29.5 | |
| 11 | 27.9 | |
| 12 | 27.3 | |
| 13 | 23.0 | |
| 14 | 21.0 | |
| 15 | 19.3 | |
| 16 | 17.6 | |
| 17 | 17.4 | |
| 18 | 17.4 | |
| 19 | 16.2 | |
| 20 | 15.1 | |
| 21 | 14.6 | |
| 22 | 11.6 |
Score distribution
Where models cluster
Correlated benchmarks
Pearson r · original research
Benchmarks that track with Balrog
Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.
Frequently asked
About Balrog
What does Balrog measure?
Balrog · benchmarks AI agents on text-based adventure games, testing language understanding, strategic planning, and long-horizon reasoning. 22 AI models have been tested on it. Scores range from 11.6 to 48.1 out of 100.
Which model leads on Balrog?
Gemini 3 Flash Preview from Google DeepMind leads Balrog with a score of 48.1. The median score across 22 tested models is 27.6.
Is Balrog saturated?
No · the top score is 48.1 out of 100 (48%). There is still meaningful room for improvement on Balrog.
Does Balrog predict performance on other benchmarks?
Yes · Balrog scores correlate 0.92 with Chatbot Arena Elo · Overall across 13 shared models. Models that do well on Balrog tend to do well on Chatbot Arena Elo · Overall.
How often is Balrog data refreshed?
BenchGecko pulls updates daily. New model scores on Balrog appear as soon as they are published by Epoch AI or the model provider.
- Category
- Reasoning
- Max score
- 100
- Models
- 22
- Updated
- 2025-12-17
Top on Balrog
Gemini 3 Flash Preview · 48.1Grok 4 · 43.6Gemini 2.5 Pro · 43.3R1 · 34.9Gemini 2.5 Flash · 33.5More reasoning benchmarks
Same category · related evaluations