HELM · GPQA
The Frontier
Best score over time · one chart, every benchmark
Distribution
Where models cluster
Correlated benchmarks
Pearson r · original research
Benchmarks that track with HELM · GPQA
Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.
Full rankings
34 models tested · sorted by score
| # | Model | Score |
|---|---|---|
| 1 | 80.3 | |
| 2 | 79.1 | |
| 3 | 75.6 | |
| 4 | 75.3 | |
| 5 | 74.9 | |
| 6 | 73.5 | |
| 7 | 72.6 | |
| 8 | 68.4 | |
| 9 | 67.9 | |
| 10 | 67.5 | |
| 11 | 66.6 | |
| 12 | 66.1 | |
| 13 | 65.9 | |
| 14 | 65.2 | |
| 15 | 65.0 | |
| 16 | 63.0 | |
| 17 | 61.4 | |
| 18 | 60.8 | |
| 19 | 59.4 | |
| 20 | 56.5 | |
| 21 | 55.6 | |
| 22 | 53.8 | |
| 23 | 53.4 | |
| 24 | 52.0 | |
| 25 | 50.7 | |
| 26 | 50.0 | |
| 27 | 44.2 | |
| 28 | 43.7 | |
| 29 | 43.5 | |
| 30 | 39.2 | |
| 31 | 39.0 | |
| 32 | 36.8 | |
| 33 | 36.3 | |
| 34 | 30.9 |
Frequently asked
Pulled from the HELM · GPQA dataset · updated daily
What does HELM · GPQA measure?
HELM · GPQA is a knowledge benchmark in the BenchGecko catalog. 34 AI models have been tested on it. Scores range from 30.9 to 80.3 out of 100.
Which model leads on HELM · GPQA?
Gemini 3 Pro from Google DeepMind leads HELM · GPQA with a score of 80.3. The median score across 34 tested models is 61.1.
Is HELM · GPQA saturated?
No · the top score is 80.3 out of 100 (80%). There is still meaningful room for improvement on HELM · GPQA.
Does HELM · GPQA predict performance on other benchmarks?
Yes · HELM · GPQA scores correlate 0.97 with Cybench across 5 shared models. Models that do well on HELM · GPQA tend to do well on Cybench.
How often is HELM · GPQA data refreshed?
BenchGecko pulls updates daily. New model scores on HELM · GPQA appear as soon as they are published by Epoch AI or the model provider.
More knowledge benchmarks
Same category · related evaluations