HELM · WildBench
The Frontier
Best score over time · one chart, every benchmark
Distribution
Where models cluster
Correlated benchmarks
Pearson r · original research
Benchmarks that track with HELM · WildBench
Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.
Full rankings
34 models tested · sorted by score
| # | Model | Score |
|---|---|---|
| 1 | 86.3 | |
| 2 | 86.2 | |
| 3 | 86.1 | |
| 4 | 85.9 | |
| 5 | 85.7 | |
| 6 | 85.7 | |
| 7 | 85.5 | |
| 8 | 85.4 | |
| 9 | 85.4 | |
| 10 | 84.9 | |
| 11 | 84.5 | |
| 12 | 83.8 | |
| 13 | 83.1 | |
| 14 | 82.8 | |
| 15 | 82.8 | |
| 16 | 81.8 | |
| 17 | 81.7 | |
| 18 | 81.4 | |
| 19 | 81.3 | |
| 20 | 81.1 | |
| 21 | 80.7 | |
| 22 | 80.6 | |
| 23 | 80.1 | |
| 24 | 80.0 | |
| 25 | 79.7 | |
| 26 | 79.2 | |
| 27 | 79.2 | |
| 28 | 79.1 | |
| 29 | 79.0 | |
| 30 | 78.8 | |
| 31 | 78.0 | |
| 32 | 76.0 | |
| 33 | 73.7 | |
| 34 | 65.1 |
Frequently asked
Pulled from the HELM · WildBench dataset · updated daily
What does HELM · WildBench measure?
HELM · WildBench is a knowledge benchmark in the BenchGecko catalog. 34 AI models have been tested on it. Scores range from 65.1 to 86.3 out of 100.
Which model leads on HELM · WildBench?
GPT-5.1 from OpenAI leads HELM · WildBench with a score of 86.3. The median score across 34 tested models is 81.6.
Is HELM · WildBench saturated?
No · the top score is 86.3 out of 100 (86%). There is still meaningful room for improvement on HELM · WildBench.
Does HELM · WildBench predict performance on other benchmarks?
Yes · HELM · WildBench scores correlate 0.96 with Artificial Analysis · Coding Index across 5 shared models. Models that do well on HELM · WildBench tend to do well on Artificial Analysis · Coding Index.
How often is HELM · WildBench data refreshed?
BenchGecko pulls updates daily. New model scores on HELM · WildBench appear as soon as they are published by Epoch AI or the model provider.
Top on HELM · WildBench
GPT-5.1 · 86.3Kimi K2 0711 · 86.2o3 · 86.1Gemini 3 Pro · 85.9Gemini 2.5 Pro · 85.7More knowledge benchmarks
Same category · related evaluations