Beta
Benchmark · Knowledge

HELM · WildBench

Updated 2026-01-21
Models tested
34
Top score
86.3
GPT-5.1
Median
81.6
min 65.1
Top-5 spread
σ 0.2
settled

Best score over time · one chart, every benchmark

HELM · WILDBENCH30 MODELS · FRONTIER RUNNING MAX0255075100SCORE ↑Jul 24Dec 24Apr 25Sep 25Jan 26RELEASE DATE →benchgecko.ai/benchmark/helm-wildbench · frontier
Frontier on HELM · WildBench rose from 79.1 to 86.3 in 16 months · +7.2 points · latest leader GPT-5.1 from OpenAI.
Pink dots = frontier records · 9 totalClick to open model page

Where models cluster

SCORE DISTRIBUTION0–1010–2020–3030–4040–5050–60160–70970–802480–9090–100MEDIAN · 81.6SCORE BUCKET → (0 TO 100)MODELSbenchgecko.ai

Pearson r · original research

34 models tested · sorted by score

Pulled from the HELM · WildBench dataset · updated daily

What does HELM · WildBench measure?

HELM · WildBench is a knowledge benchmark in the BenchGecko catalog. 34 AI models have been tested on it. Scores range from 65.1 to 86.3 out of 100.

Which model leads on HELM · WildBench?

GPT-5.1 from OpenAI leads HELM · WildBench with a score of 86.3. The median score across 34 tested models is 81.6.

Is HELM · WildBench saturated?

No · the top score is 86.3 out of 100 (86%). There is still meaningful room for improvement on HELM · WildBench.

Does HELM · WildBench predict performance on other benchmarks?

Yes · HELM · WildBench scores correlate 0.96 with Artificial Analysis · Coding Index across 5 shared models. Models that do well on HELM · WildBench tend to do well on Artificial Analysis · Coding Index.

How often is HELM · WildBench data refreshed?

BenchGecko pulls updates daily. New model scores on HELM · WildBench appear as soon as they are published by Epoch AI or the model provider.

Same category · related evaluations