Which model leads on GPQA diamond?

Claude Mythos Preview from Anthropic leads GPQA diamond with a score of 94.5. The median score across 98 tested models is 47.9.

Is GPQA diamond saturated?

No · the top score is 94.5 out of 100 (95%). There is still meaningful room for improvement on GPQA diamond.

Does GPQA diamond predict performance on other benchmarks?

Yes · GPQA diamond scores correlate 0.98 with OpenCompass · MMLU-Pro across 10 shared models. Models that do well on GPQA diamond tend to do well on OpenCompass · MMLU-Pro.

How often is GPQA diamond data refreshed?

BenchGecko pulls updates daily. New model scores on GPQA diamond appear as soon as they are published by Epoch AI or the model provider.

Benchmark · ReasoningSettled

GPQA diamond

Name: GPQA diamond Benchmark
Creator: BenchGecko
License: https://creativecommons.org/licenses/by/4.0/

Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.

Updated 2026-04-23

Opus 4.7 hits 94.2%. Five models above 91%. The benchmark is approaching saturation.

Scoring: Exact-match accuracy on 198 multiple-choice questions. Four options per question. No partial credit.

Models tested

100

Top score

94.5

Claude Mythos Preview

Median

47.9

min 1.3

Top-5 spread

σ 0.9

Settled

The Frontier

Best score over time · one chart, every benchmark

Chart type

Frontier on GPQA diamond rose from 34.5 to 94.5 in 21 months · +60.0 points · latest leader Claude Mythos Preview from Anthropic.

Pink dots = frontier records · 13 totalClick to open model page

Full rankings

100 models tested · sorted by score · includes 7 verified scores

#	Model	Score	Price	Source
1	Claude Mythos Preview· Anthropic	94.5	—	Anthropic Mythos System Card, Apr 2026
2	Gemini 3.1 Pro· Googleverified	94.3	—	Google Gemini 3.1 Pro Announcement, Mar 2026
3	GPT-5.5 Pro· OpenAI	94.2	$30.00	OpenAI GPT-5.5 Blog, Apr 2026
4	Claude Opus 4.7· Anthropicverified	94.2	—	Anthropic Opus 4.7 Announcement, Apr 2026
5	GPT-5.5· OpenAI	93.6	$5.00	OpenAI GPT-5.5 Blog, Apr 2026
6	GPT-5.4 Pro· OpenAI	92.8	$30.00
7	Gemini 3.1 Pro Preview· Google DeepMind	92.1	$2.00
8	GPT-5.4· OpenAI	91.1	$2.50	OpenAI GPT-5.4 Blog, Mar 2026
9	Gemini 3 Pro· Google DeepMind	90.2	—
10	GPT-5.2· OpenAI	88.5	$1.75
11	Claude Opus 4.6· Anthropic	87.4	$5.00	Anthropic Opus 4.6 System Card, Feb 2026
12	U Muse Spark· Unknown	86.4	—
13	GLM 5· z-ai	83.8	$0.60
14	GPT-5.1· OpenAI	83.5	$1.25
15	Kimi K2.5· moonshotai	83.5	$0.44
16	Claude Sonnet 4.6· Anthropic	83.2	$3.00
17	Grok 4· xAI	82.7	$3.00
18	GPT-5· OpenAI	81.6	$1.25
19	Claude Opus 4.5· Anthropic	81.4	$5.00
20	Gemini 2.5 Pro· Google DeepMind	80.4	$1.25
21	Kimi K2 Thinking· moonshotai	79.0	$0.60
22	DeepSeek V3.2· DeepSeek	77.9	$0.25
23	GLM 4.7· z-ai	77.8	$0.38
24	Gemini 3 Flash Preview· Google DeepMind	77.6	$0.50
25	Claude Sonnet 4.5· Anthropic	76.4	$3.00
26	o3· OpenAI	75.8	$2.00
27	Qwen3 235B A22B Thinking 2507· Alibaba Qwen	73.4	$0.15
28	Claude 3.7 Sonnet· Anthropic	73.0	$3.00
29	o4 Mini· OpenAI	72.8	$1.10
30	Claude Sonnet 4· Anthropic	72.3	$3.00
31	Claude Opus 4.1· Anthropic	69.7	$15.00
32	o3 Mini· OpenAI	69.4	$1.10
33	o1· OpenAI	69.0	$15.00
34	R1 0528· DeepSeek	68.4	$0.50
35	Claude Opus 4· Anthropic	68.3	$15.00
36	Grok 3 Mini· xAI	68.3	$0.30
37	gpt-oss-120b· OpenAI	67.7	$0.04
38	Grok 3· xAI	67.7	$3.00
39	GPT-5 Mini· OpenAI	66.7	$0.25
40	Qwen3 Max· Alibaba Qwen	63.5	$0.78
41	R1· DeepSeek	62.3	$0.70
42	Claude Haiku 4.5· Anthropic	61.6	$1.00
43	Qwen3 235B A22B· Alibaba Qwen	60.9	$0.46
44	GPT-5 Nano· OpenAI	59.3	$0.05
45	GPT-4.5· OpenAI	58.3	—
46	Llama 4 Maverick· Meta	56.0	$0.15
47	GPT-4.1· OpenAI	55.9	$2.00
48	GPT-4.1 Mini· OpenAI	54.5	$0.40
49	Gemini 2.0 Pro· Google DeepMind	54.2	—
50	Gemini 2.0 Flash· Google DeepMind	52.2	$0.10
51	o1-mini· OpenAI	49.8	—
52	Mistral Medium 3· Mistral AI	46.0	$0.40
53	Gemini 2.0 Flash Thinking (Jan 2025)· Google DeepMind	42.8	—
54	DeepSeek V3· DeepSeek	42.0	$0.32
55	Qwen2.5-Max· Alibaba Qwen	41.5	—
56	Phi 4· Microsoft	41.4	$0.07
57	Claude 3.5 Sonnet· Anthropic	38.7	—
58	Grok-2 (Dec 2024)· xAI	38.4	—
59	Llama 4 Scout· Meta	35.8	$0.08
60	Mistral Large 2411· Mistral AI	35.1	$2.00
61	Llama 3.1 405B· Meta	34.5	—
62	o1-preview· OpenAI	33.8	—
63	GPT-4o (2024-08-06)· OpenAI	32.3	$2.50
64	GPT-4o (2024-11-20)· OpenAI	32.3	$2.50
65	Qwen2.5 72B Instruct· Alibaba Qwen	32.2	$0.36
66	Mistral Large 2407· Mistral AI	32.0	$2.00
67	GPT-4.1 Nano· OpenAI	31.9	$0.10
68	GPT-4o (2024-05-13)· OpenAI	31.9	$5.00
69	Gemma 3 27B· Google DeepMind	31.8	$0.08
70	Gemma 3 27B (free)· Google DeepMind	31.8	$0.00
71	U Magistral Small 1.1· Unknown	31.2	—
72	Llama 3.3 70B Instruct (free)· Meta	29.9	$0.00
73	Claude 3 Opus· Anthropic	29.6	—
74	Gemini 1.5 Pro (Feb 2024)· Google DeepMind	27.8	—
75	Llama 3.1 70B Instruct· Meta	25.6	$0.40
76	Llama 3.2 90B· Meta	21.4	—
77	Qwen2-72B· Alibaba Qwen	21.0	—
78	Claude 3 Sonnet· Anthropic	20.8	—
79	Llama 3 70B Instruct· Meta	20.8	$0.51
80	Gemini 1.5 Flash (May 2024)· Google DeepMind	20.5	—
81	Mistral Large· Mistral AI	18.4	$2.00
82	Claude 3.5 Haiku· Anthropic	17.5	$0.80
83	GPT-4o-mini· OpenAI	17.0	$0.15
84	GPT-4o-mini (2024-07-18)· OpenAI	17.0	$0.15
85	Gemma 2 27B· Google DeepMind	15.3	$0.65
86	Claude 3 Haiku· Anthropic	15.1	$0.25
87	GPT-4 (older v0314)· OpenAI	14.3	$30.00
88	Claude 2· Anthropic	12.9	—
89	Mixtral 8x22B Instruct· Mistral AI	12.1	$2.00
90	Gemini 1.0 Pro· Google DeepMind	11.9	—
91	Claude 2.1· Anthropic	10.6	—
92	GPT-4 Turbo· OpenAI	7.5	$10.00
93	Mixtral 8x7B Instruct· Mistral AI	7.5	$0.54
94	Mistral Nemo· Mistral AI	6.5	$0.02
95	phi-3-medium 14B· Microsoft	3.5	—
96	Gemma 2 9B· Google DeepMind	3.3	$0.03
97	GPT-3.5 Turbo (older v0613)· OpenAI	2.9	$1.00
98	Llama 2-13B· Meta	1.8	—
99	Llama 3 8B Instruct· Meta	1.4	$0.03
100	Llama 3.1 8B Instruct· Meta	1.3	$0.02

Details

Category: Reasoning
Creator: NYU (Rein et al.)
Max score: 100
Dataset: 198 questions
Modality: Text
Format: JSON
License: CC-BY-4.0
Scoring: Exact-match accuracy on 198 multiple-choice questions. Four options per question. No partial credit.
Models: 100
Published: 2023-11-20
Updated: 2026-04-23

Tests

Expert-level sciencePhysicsBiologyChemistryReasoning under uncertainty

Does not test

CodeLong contextSpeedTool useVisionCreative writing

Links

GPQA Paper (Rein et al., 2023)Anthropic Opus 4.7 Announcement OpenAI GPT-5.4 Blog Learn more

Gecko's Take

“GPQA Diamond served its purpose brilliantly, but the ceiling is cracking. When models outscore the PhDs who wrote the questions, we need GPQA Platinum.”

Related benchmarks