Beta
Benchmark · Knowledge

HELM · GPQA

Updated 2026-01-21
Models tested
34
Top score
80.3
Gemini 3 Pro
Median
61.1
min 30.9
Top-5 spread
σ 2.2
competitive

Best score over time · one chart, every benchmark

HELM · GPQA30 MODELS · FRONTIER RUNNING MAX0255075100SCORE ↑Jul 24Dec 24Apr 25Sep 25Jan 26RELEASE DATE →benchgecko.ai/benchmark/helm-gpqa · frontier
Frontier on HELM · GPQA rose from 36.8 to 79.1 in 13 months · +42.3 points · latest leader GPT-5 Chat from OpenAI.
Pink dots = frontier records · 10 totalClick to open model page

Where models cluster

SCORE DISTRIBUTION0–1010–2020–30530–40340–50850–601160–70670–80180–9090–100MEDIAN · 61.1SCORE BUCKET → (0 TO 100)MODELSbenchgecko.ai

Pearson r · original research

34 models tested · sorted by score

Pulled from the HELM · GPQA dataset · updated daily

What does HELM · GPQA measure?

HELM · GPQA is a knowledge benchmark in the BenchGecko catalog. 34 AI models have been tested on it. Scores range from 30.9 to 80.3 out of 100.

Which model leads on HELM · GPQA?

Gemini 3 Pro from Google DeepMind leads HELM · GPQA with a score of 80.3. The median score across 34 tested models is 61.1.

Is HELM · GPQA saturated?

No · the top score is 80.3 out of 100 (80%). There is still meaningful room for improvement on HELM · GPQA.

Does HELM · GPQA predict performance on other benchmarks?

Yes · HELM · GPQA scores correlate 0.97 with Cybench across 5 shared models. Models that do well on HELM · GPQA tend to do well on Cybench.

How often is HELM · GPQA data refreshed?

BenchGecko pulls updates daily. New model scores on HELM · GPQA appear as soon as they are published by Epoch AI or the model provider.

Same category · related evaluations