Benchmark · ReasoningSettled

GPQA diamond

Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.

Updated 2026-04-23

Opus 4.7 hits 94.2%. Five models above 91%. The benchmark is approaching saturation.

Scoring: Exact-match accuracy on 198 multiple-choice questions. Four options per question. No partial credit.

Models tested
100
Top score
94.5
Claude Mythos Preview
Median
47.9
min 1.3
Top-5 spread
σ 0.9
Settled

Best score over time · one chart, every benchmark

GPQA DIAMOND66 MODELS · FRONTIER RUNNING MAX0255075100SCORE ↑Jun 24Dec 24May 25Nov 25Apr 26RELEASE DATE →benchgecko.ai/benchmark/gpqa-diamond · frontier
Frontier on GPQA diamond rose from 34.5 to 94.5 in 21 months · +60.0 points · latest leader Claude Mythos Preview from Anthropic.
Pink dots = frontier records · 13 totalClick to open model page

100 models tested · sorted by score · includes 7 verified scores

#ModelScore
1Anthropic logoClaude Mythos Preview94.5
2Google logoGemini 3.1 Proverified94.3
3OpenAI logoGPT-5.5 Pro94.2
4Anthropic logoClaude Opus 4.7verified94.2
5OpenAI logoGPT-5.593.6
6OpenAI logoGPT-5.4 Pro92.8
7Google DeepMind logoGemini 3.1 Pro Preview92.1
8OpenAI logoGPT-5.491.1
9Google DeepMind logoGemini 3 Pro90.2
10OpenAI logoGPT-5.288.5
11Anthropic logoClaude Opus 4.687.4
12
U
Muse Spark
86.4
13z-ai logoGLM 583.8
14OpenAI logoGPT-5.183.5
15moonshotai logoKimi K2.583.5
16Anthropic logoClaude Sonnet 4.683.2
17xAI logoGrok 482.7
18OpenAI logoGPT-581.6
19Anthropic logoClaude Opus 4.581.4
20Google DeepMind logoGemini 2.5 Pro80.4
21moonshotai logoKimi K2 Thinking79.0
22DeepSeek logoDeepSeek V3.277.9
23z-ai logoGLM 4.777.8
24Google DeepMind logoGemini 3 Flash Preview77.6
25Anthropic logoClaude Sonnet 4.576.4
26OpenAI logoo375.8
27Alibaba Qwen logoQwen3 235B A22B Thinking 250773.4
28Anthropic logoClaude 3.7 Sonnet73.0
29OpenAI logoo4 Mini72.8
30Anthropic logoClaude Sonnet 472.3
31Anthropic logoClaude Opus 4.169.7
32OpenAI logoo3 Mini69.4
33OpenAI logoo169.0
34DeepSeek logoR1 052868.4
35Anthropic logoClaude Opus 468.3
36xAI logoGrok 3 Mini68.3
37OpenAI logogpt-oss-120b67.7
38xAI logoGrok 367.7
39OpenAI logoGPT-5 Mini66.7
40Alibaba Qwen logoQwen3 Max63.5
41DeepSeek logoR162.3
42Anthropic logoClaude Haiku 4.561.6
43Alibaba Qwen logoQwen3 235B A22B60.9
44OpenAI logoGPT-5 Nano59.3
45OpenAI logoGPT-4.558.3
46Meta logoLlama 4 Maverick56.0
47OpenAI logoGPT-4.155.9
48OpenAI logoGPT-4.1 Mini54.5
49Google DeepMind logoGemini 2.0 Pro54.2
50Google DeepMind logoGemini 2.0 Flash52.2
51OpenAI logoo1-mini49.8
52Mistral AI logoMistral Medium 346.0
53Google DeepMind logoGemini 2.0 Flash Thinking (Jan 2025)42.8
54DeepSeek logoDeepSeek V342.0
55Alibaba Qwen logoQwen2.5-Max41.5
56Microsoft logoPhi 441.4
57Anthropic logoClaude 3.5 Sonnet38.7
58xAI logoGrok-2 (Dec 2024)38.4
59Meta logoLlama 4 Scout35.8
60Mistral AI logoMistral Large 241135.1
61Meta logoLlama 3.1 405B34.5
62OpenAI logoo1-preview33.8
63OpenAI logoGPT-4o (2024-08-06)32.3
64OpenAI logoGPT-4o (2024-11-20)32.3
65Alibaba Qwen logoQwen2.5 72B Instruct32.2
66Mistral AI logoMistral Large 240732.0
67OpenAI logoGPT-4.1 Nano31.9
68OpenAI logoGPT-4o (2024-05-13)31.9
69Google DeepMind logoGemma 3 27B31.8
70Google DeepMind logoGemma 3 27B (free)31.8
71
U
Magistral Small 1.1
31.2
72Meta logoLlama 3.3 70B Instruct (free)29.9
73Anthropic logoClaude 3 Opus29.6
74Google DeepMind logoGemini 1.5 Pro (Feb 2024)27.8
75Meta logoLlama 3.1 70B Instruct25.6
76Meta logoLlama 3.2 90B21.4
77Alibaba Qwen logoQwen2-72B21.0
78Anthropic logoClaude 3 Sonnet20.8
79Meta logoLlama 3 70B Instruct20.8
80Google DeepMind logoGemini 1.5 Flash (May 2024)20.5
81Mistral AI logoMistral Large18.4
82Anthropic logoClaude 3.5 Haiku17.5
83OpenAI logoGPT-4o-mini17.0
84OpenAI logoGPT-4o-mini (2024-07-18)17.0
85Google DeepMind logoGemma 2 27B15.3
86Anthropic logoClaude 3 Haiku15.1
87OpenAI logoGPT-4 (older v0314)14.3
88Anthropic logoClaude 212.9
89Mistral AI logoMixtral 8x22B Instruct12.1
90Google DeepMind logoGemini 1.0 Pro11.9
91Anthropic logoClaude 2.110.6
92OpenAI logoGPT-4 Turbo7.5
93Mistral AI logoMixtral 8x7B Instruct7.5
94Mistral AI logoMistral Nemo6.5
95Microsoft logophi-3-medium 14B3.5
96Google DeepMind logoGemma 2 9B3.3
97OpenAI logoGPT-3.5 Turbo (older v0613)2.9
98Meta logoLlama 2-13B1.8
99Meta logoLlama 3 8B Instruct1.4
100Meta logoLlama 3.1 8B Instruct1.3
Details
Category
Reasoning
Creator
NYU (Rein et al.)
Max score
100
Dataset
198 questions
Modality
Text
Format
JSON
License
CC-BY-4.0
Scoring
Exact-match accuracy on 198 multiple-choice questions. Four options per question. No partial credit.
Models
100
Published
2023-11-20
Updated
2026-04-23
Tests
Expert-level sciencePhysicsBiologyChemistryReasoning under uncertainty
Does not test
CodeLong contextSpeedTool useVisionCreative writing
Gecko's Take

GPQA Diamond served its purpose brilliantly, but the ceiling is cracking. When models outscore the PhDs who wrote the questions, we need GPQA Platinum.

Same category · related evaluations