Benchmark · ReasoningCompetitive

Balrog

Balrog · benchmarks AI agents on text-based adventure games, testing language understanding, strategic planning, and long-horizon reasoning.

Updated 2025-12-17
Models tested
22
Top score
48.1
Gemini 3 Flash Preview
Median
27.6
min 11.6
Top-5 spread
σ 5.6
wide open

Best score over time · one chart, every benchmark

BALROG17 MODELS · FRONTIER RUNNING MAX0255075100SCORE ↑Jul 24Nov 24Apr 25Aug 25Dec 25RELEASE DATE →benchgecko.ai/benchmark/balrog · frontier
Frontier on Balrog rose from 34.9 to 48.1 in 11 months · +13.2 points · latest leader Gemini 3 Flash Preview from Google DeepMind.
Pink dots = frontier records · 4 totalClick to open model page
Details
Category
Reasoning
Max score
100
Models
22
Updated
2025-12-17

Same category · related evaluations