Which model leads on Balrog?

Gemini 3 Flash Preview from Google DeepMind leads Balrog with a score of 48.1. The median score across 22 tested models is 27.6.

No · the top score is 48.1 out of 100 (48%). There is still meaningful room for improvement on Balrog.

Does Balrog predict performance on other benchmarks?

Yes · Balrog scores correlate 0.92 with Chatbot Arena Elo · Overall across 13 shared models. Models that do well on Balrog tend to do well on Chatbot Arena Elo · Overall.

How often is Balrog data refreshed?

BenchGecko pulls updates daily. New model scores on Balrog appear as soon as they are published by Epoch AI or the model provider.

Benchmark · ReasoningCompetitive

Balrog

Name: Balrog Benchmark
Creator: BenchGecko
License: https://creativecommons.org/licenses/by/4.0/

Balrog · benchmarks AI agents on text-based adventure games, testing language understanding, strategic planning, and long-horizon reasoning.

Updated 2025-12-17

Models tested

Top score

48.1

Gemini 3 Flash Preview

Median

27.6

min 11.6

Top-5 spread

σ 5.6

wide open

The Frontier

Best score over time · one chart, every benchmark

Chart type

Frontier on Balrog rose from 34.9 to 48.1 in 11 months · +13.2 points · latest leader Gemini 3 Flash Preview from Google DeepMind.

Pink dots = frontier records · 4 totalClick to open model page

Full rankings

22 models tested · sorted by score

#	Model	Score	Price
1	Gemini 3 Flash Preview· Google DeepMind	48.1	$0.50
2	Grok 4· xAI	43.6	$3.00
3	Gemini 2.5 Pro· Google DeepMind	43.3	$1.25
4	R1· DeepSeek	34.9	$0.70
5	Gemini 2.5 Flash· Google DeepMind	33.5	$0.30
6	GPT-5· OpenAI	32.8	$1.25
7	Claude 3.5 Sonnet· Anthropic	32.6	—
8	GPT-4o (2024-05-13)· OpenAI	32.3	$5.00
9	GPT-4o (2024-11-20)· OpenAI	32.3	$2.50
10	Grok 3· xAI	29.5	$3.00
11	Llama 3.1 70B Instruct· Meta	27.9	$0.40
12	Llama 3.2 90B· Meta	27.3	—
13	Llama 3.3 70B Instruct (free)· Meta	23.0	$0.00
14	Gemini 1.5 Pro (Feb 2024)· Google DeepMind	21.0	—
15	Claude 3.5 Haiku· Anthropic	19.3	$0.80
16	Mistral Nemo· Mistral AI	17.6	$0.02
17	GPT-4o-mini· OpenAI	17.4	$0.15
18	GPT-4o-mini (2024-07-18)· OpenAI	17.4	$0.15
19	Qwen2.5 72B Instruct· Alibaba Qwen	16.2	$0.36
20	Llama 3.1 8B Instruct· Meta	15.1	$0.02
21	Gemini 1.5 Flash (May 2024)· Google DeepMind	14.6	—
22	Phi 4· Microsoft	11.6	$0.07