Balrog
Balrog β benchmarks AI agents on text-based adventure games, testing language understanding, strategic planning, and long-horizon reasoning.
20
Models Tested
48.1
Top Score
25.1
Average Score
Rankings
| # | Model | Score | Bar |
|---|---|---|---|
| 1 | 48.1 | ||
| 2 | 43.6 | ||
| 3 | 34.9 | ||
| 4 | 32.8 | ||
| 5 | 32.8 | ||
| 6 | 29.5 | ||
| 7 | 29.5 | ||
| 8 | 27.9 | ||
| 9 | 27.3 | ||
| 10 | 23.0 | ||
| 11 | 23.0 | ||
| 12 | 21.0 | ||
| 13 | 19.3 | ||
| 14 | 17.6 | ||
| 15 | 17.4 | ||
| 16 | 17.4 | ||
| 17 | 16.2 | ||
| 18 | 15.1 | ||
| 19 | 14.6 | ||
| 20 | 11.6 |