API
Benchmarks/Balrog

Balrog

Balrog β€” benchmarks AI agents on text-based adventure games, testing language understanding, strategic planning, and long-horizon reasoning.

20
Models Tested
48.1
Top Score
25.1
Average Score