#	Model	Score	Price
1	GPT-5.1· OpenAI	86.3	$1.25
2	Kimi K2 0711· moonshotai	86.2	$0.57
3	o3· OpenAI	86.1	$2.00
4	Gemini 3 Pro· Google DeepMind	85.9	—
5	Gemini 2.5 Pro· Google DeepMind	85.7	$1.25
6	GPT-5 Chat· OpenAI	85.7	$1.25
7	GPT-5 Mini· OpenAI	85.5	$0.25
8	GPT-4.1· OpenAI	85.4	$2.00
9	o4 Mini· OpenAI	85.4	$1.10
10	Grok 3 Beta· xAI	84.9	$3.00
11	gpt-oss-120b· OpenAI	84.5	$0.04
12	GPT-4.1 Mini· OpenAI	83.8	$0.40
13	DeepSeek V3· DeepSeek	83.1	$0.32
14	GPT-4o (2024-11-20)· OpenAI	82.8	$2.50
15	R1 0528· DeepSeek	82.8	$0.50
16	Gemini 2.5 Flash Lite· Google DeepMind	81.8	$0.10
17	Gemini 2.5 Flash· Google DeepMind	81.7	$0.30
18	Claude 3.7 Sonnet· Anthropic	81.4	$3.00
19	Gemini 1.5 Pro (Feb 2024)· Google DeepMind	81.3	—
20	GPT-4.1 Nano· OpenAI	81.1	$0.10
21	Qwen3 Next 80B A3B Thinking· Alibaba Qwen	80.7	$0.10
22	GPT-5 Nano· OpenAI	80.6	$0.05
23	Mistral Large 2411· Mistral AI	80.1	$2.00
24	Gemini 2.0 Flash· Google DeepMind	80.0	$0.10
25	Grok 4· xAI	79.7	$3.00
26	Claude 3.5 Sonnet· Anthropic	79.2	—
27	Gemini 1.5 Flash (May 2024)· Google DeepMind	79.2	—
28	GPT-4o-mini (2024-07-18)· OpenAI	79.1	$0.15
29	Gemini 2.0 Flash Lite· Google DeepMind	79.0	$0.07
30	Mistral Small 3.1 24B· Mistral AI	78.8	$0.35
31	Palmyra X5· writer	78.0	$0.60
32	Claude 3.5 Haiku· Anthropic	76.0	$0.80
33	gpt-oss-20b· OpenAI	73.7	$0.03
34	Grok 3 Mini Beta· xAI	65.1	$0.30

Frequently asked

Pulled from the HELM · WildBench dataset · updated daily

What does HELM · WildBench measure?

HELM · WildBench is a knowledge benchmark in the BenchGecko catalog. 34 AI models have been tested on it. Scores range from 65.1 to 86.3 out of 100.

Which model leads on HELM · WildBench?

GPT-5.1 from OpenAI leads HELM · WildBench with a score of 86.3. The median score across 34 tested models is 81.6.

Is HELM · WildBench saturated?

No · the top score is 86.3 out of 100 (86%). There is still meaningful room for improvement on HELM · WildBench.

Does HELM · WildBench predict performance on other benchmarks?

Yes · HELM · WildBench scores correlate 0.96 with Artificial Analysis · Coding Index across 5 shared models. Models that do well on HELM · WildBench tend to do well on Artificial Analysis · Coding Index.

How often is HELM · WildBench data refreshed?

BenchGecko pulls updates daily. New model scores on HELM · WildBench appear as soon as they are published by Epoch AI or the model provider.