GPQA diamond
Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.
Opus 4.7 hits 94.2%. Five models above 91%. The benchmark is approaching saturation.
Scoring: Exact-match accuracy on 198 multiple-choice questions. Four options per question. No partial credit.
The Frontier
Best score over time · one chart, every benchmark
Full rankings
100 models tested · sorted by score · includes 7 verified scores
Score distribution
Where models cluster
Correlated benchmarks
Pearson r · original research
Benchmarks that track with GPQA diamond
Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.
How it works
Evaluation methodology
GPQA Diamond contains 198 PhD-level questions across physics, biology, and chemistry, deliberately crafted to be "Google-proof." Domain experts with unrestricted internet access answer correctly only about 65% of the time. Each question was validated by multiple experts in the field to ensure both difficulty and correctness. Models are evaluated on exact-match accuracy using a multiple-choice format with four options per question.
Industry relevance
Why teams track this benchmark
GPQA Diamond is the standard gauge for expert-level scientific reasoning. When five frontier models score above 91%, the benchmark signals approaching saturation and the need for harder successors. It remains the go-to for comparing scientific depth across providers.
Practical takeaways
By role
If your product requires expert-level science answers, any model above 91% on GPQA Diamond will perform reliably. Differentiate on speed and cost instead.
Saturation above 94% signals diminishing returns on scientific QA alone. The moat is shifting to tool-augmented and multi-modal reasoning.
The original dataset and evaluation code are open-access on arXiv. GPQA Diamond is the canonical hard-QA reference for ablation studies.
Frequently asked
About GPQA diamond
What does GPQA diamond measure?
Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs. 98 AI models have been tested on it. Scores range from 1.3 to 94.5 out of 100.
Which model leads on GPQA diamond?
Claude Mythos Preview from Anthropic leads GPQA diamond with a score of 94.5. The median score across 98 tested models is 47.9.
Is GPQA diamond saturated?
No · the top score is 94.5 out of 100 (95%). There is still meaningful room for improvement on GPQA diamond.
Does GPQA diamond predict performance on other benchmarks?
Yes · GPQA diamond scores correlate 0.98 with OpenCompass · MMLU-Pro across 10 shared models. Models that do well on GPQA diamond tend to do well on OpenCompass · MMLU-Pro.
How often is GPQA diamond data refreshed?
BenchGecko pulls updates daily. New model scores on GPQA diamond appear as soon as they are published by Epoch AI or the model provider.
- Category
- Reasoning
- Creator
- NYU (Rein et al.)
- Max score
- 100
- Dataset
- 198 questions
- Modality
- Text
- Format
- JSON
- License
- CC-BY-4.0
- Scoring
- Exact-match accuracy on 198 multiple-choice questions. Four options per question. No partial credit.
- Models
- 100
- Published
- 2023-11-20
- Updated
- 2026-04-23
“GPQA Diamond served its purpose brilliantly, but the ceiling is cracking. When models outscore the PhDs who wrote the questions, we need GPQA Platinum.”
Top on GPQA diamond
Claude Mythos Preview · 94.5Gemini 3.1 Pro · 94.3GPT-5.5 Pro · 94.2Claude Opus 4.7 · 94.2GPT-5.5 · 93.6More reasoning benchmarks
Same category · related evaluations