Beta
Benchmark · Knowledge

GPQA

Updated 2025-07-24
Models tested
67
Top score
19.7
Meta Llama 3 8B
Median
4.6
min 0.5
Top-5 spread
σ 0.8
settled

Best score over time · one chart, every benchmark

GPQA45 MODELS · FRONTIER RUNNING MAX0255075100SCORE ↑Apr 24Aug 24Dec 24Mar 25Jul 25RELEASE DATE →benchgecko.ai/benchmark/hf-gpqa · frontier
Only 45 models have been tested on GPQA · not enough history to compute a frontier yet.
Pink dots = frontier records · 1 totalClick to open model page

Where models cluster

SCORE DISTRIBUTION470–102010–2020–3030–4040–5050–6060–7070–8080–9090–100MEDIAN · 4.6SCORE BUCKET → (0 TO 100)MODELSbenchgecko.ai

Pearson r · original research

Correlation analysis

Benchmarks that track with GPQA

Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.

67 models tested · sorted by score

#ModelScore
1Meta logoMeta Llama 3 8B19.7
2
HA
Qwen2.5 72B Instruct Abliterated
19.4
3Alibaba Qwen logoQwen2-72B19.2
4DeepSeek logoDeepSeek R1 Distill Qwen 14B18.3
5Microsoft logoWizardLM-2 8x22B17.6
6Google DeepMind logoGemma 2 27B16.7
7Alibaba Qwen logoQwen2.5 72B Instruct16.7
8
U
Yi 6B
15.6
9nousresearch logoHermes 3 70B Instruct14.9
10Google DeepMind logoGemma 2 9B14.8
11Meta logoLlama 3.1 70B Instruct14.2
12Alibaba Qwen logoQwen2.5 Coder 32B Instruct13.2
13
D
Dolphin 2.9.1 Yi 1.5 34b
12.4
14Microsoft logoPhi 3.5 Mini Instruct12.0
15Alibaba logoQwen2.5 32B Instruct11.7
16Microsoft logoPhi 411.5
17Microsoft logoPhi 3 Mini 4k Instruct11.0
18Meta logoLlama 3.3 70B Instruct10.5
19Alibaba logoQwen2.5 14B Instruct10.5
20anthracite-org logoMagnum v4 72B10.4
21Meta logoLlama 3.1 8B Instruct9.5
22Alibaba logoQwen2 VL 7B Instruct9.3
23z-ai logoGLM 4 32B 8.8
24
U
Stable Beluga 2
8.8
25Microsoft logoPhi 4 Mini Instruct7.9
26Alibaba logoQwen2.5 Coder 14B Instruct7.3
27Alibaba logoQwen2 7B Instruct6.4
28Meta logoLlama 3.1 405B5.9
29nousresearch logoHermes 2 Pro - Llama-3 8B5.7
30Mistral AI logoMistral 7B V0.15.6
31Alibaba Qwen logoQwen2.5 Coder 7B Instruct5.6
32Alibaba Qwen logoQwen2.5 7B Instruct5.5
33Google DeepMind logoGemma 2B4.9
34DeepSeek logoR1 Distill Qwen 32B4.6
35DeepSeek logoDeepSeek R1 Distill Qwen 7B3.9
36Meta logoLlama 3.2 3B Instruct3.8
37Meta logoLLaMA-13B3.5
38Mistral AI logoMistral 7B Instruct V0.23.5
39Google DeepMind logoGemma 2 2b It3.2
40
U
StarCoder 2 15B
3.1
41Alibaba logoQwen2.5 3B Instruct3.0
42Microsoft logoPhi 22.9
43TII logoFalcon-180B2.8
44Meta logoLlama 3.2 1B Instruct2.4
45Meta logoLlama 3.2 3B Instruct (free)2.4
46Microsoft logoPhi-1.52.4
47Meta logoLlama 2 7b Hf2.2
48Meta logoLlama 3 8B Instruct2.1
49DeepSeek logoR1 Distill Llama 70B2.0
50Google DeepMind logoGemma 2 2b1.7
51OpenAI logoGpt2 Medium1.7
52Alibaba logoQwen2 1.5B Instruct1.6
53Alibaba logoQwen2 0.5B1.4
54
U
MPT-30B
1.3
55Alibaba Qwen logoQwQ 32B1.3
56
D
Distilgpt2
1.2
57OpenAI logoGpt2 Large1.2
58Meta logoMeta Llama 3 8B Instruct1.2
59Alibaba logoQwen2.5 0.5B Instruct1.2
60OpenAI logoGpt21.1
61eleutherai logoPythia 160m1.1
62
L
Vicuna 7b V1.5
1.1
63DeepSeek logoDeepSeek R1 Distill Qwen 1.5B0.8
64Alibaba logoQwen2.5 1.5B Instruct0.8
65DeepSeek logoDeepSeek R1 Distill Llama 8B0.7
66eleutherai logoGpt Neo 125m0.5
67Meta logoLlama 2 7b Chat Hf0.5

Pulled from the GPQA dataset · updated daily

What does GPQA measure?

GPQA is a knowledge benchmark in the BenchGecko catalog. 67 AI models have been tested on it. Scores range from 0.5 to 19.7 out of 100.

Which model leads on GPQA?

Meta Llama 3 8B from Meta leads GPQA with a score of 19.7. The median score across 67 tested models is 4.6.

Is GPQA saturated?

No · the top score is 19.7 out of 100 (20%). There is still meaningful room for improvement on GPQA.

Does GPQA predict performance on other benchmarks?

Yes · GPQA scores correlate 0.97 with CMMLU across 5 shared models. Models that do well on GPQA tend to do well on CMMLU.

How often is GPQA data refreshed?

BenchGecko pulls updates daily. New model scores on GPQA appear as soon as they are published by Epoch AI or the model provider.

Same category · related evaluations