HLE
HLE (Humanity's Last Exam) · a reasoning benchmark designed to be the hardest public evaluation of AI. Questions span mathematics, physics, philosophy, and logic · curated to be at or beyond the frontier of human expert capability. Tested with and without tool augmentation. Claude Opus 4.7 scores 46.9% without tools and 54.7% with tools · making it one of the few benchmarks where the top score is below 60%.
Top score is below 65%. Most frontier models score below 55%. Humanity's Last Exam lives up to its name.
Scoring: Accuracy on verified questions. Evaluated in two modes: no-tools (pure reasoning) and with-tools (code execution + search access). Scores reported separately per mode.
The Frontier
Best score over time · one chart, every benchmark
Full rankings
26 models tested · sorted by score · includes 5 verified scores
| # | Model | Score |
|---|---|---|
| 1 | 56.8 | |
| 2 | 56.8 | |
| 3 | 44.4 | |
| 4 | 42.7 | |
| 5 | 34.4 | |
| 6 | 31.1 | |
| 7 | 28.2 | |
| 8 | 24.2 | |
| 9 | 21.6 | |
| 10 | 21.4 | |
| 11 | 20.6 | |
| 12 | 19.8 | |
| 13 | 17.7 | |
| 14 | 16.3 | |
| 15 | 15.4 | |
| 16 | 13.9 | |
| 17 | 9.4 | |
| 18 | 7.7 | |
| 19 | 7.1 | |
| 20 | 6.2 | |
| 21 | 3.4 | |
| 22 | 3.1 | |
| 23 | 1.9 | |
| 24 | 0.9 | |
| 25 | 0.7 | |
| 26 | 0.6 |
Score distribution
Where models cluster
Correlated benchmarks
Pearson r · original research
Benchmarks that track with HLE
Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.
How it works
Evaluation methodology
Humanity's Last Exam is a crowd-sourced collection of the hardest questions humans can write, spanning mathematics, physics, philosophy, logic, and esoteric knowledge domains. Questions are sourced from domain experts and designed to be at or beyond PhD-level difficulty. The evaluation is conducted in two settings: without tools (pure reasoning) and with tools (code execution, search). Each question has a verified correct answer. The benchmark is maintained as a living dataset with new questions added to prevent contamination.
Industry relevance
Why teams track this benchmark
HLE is the ceiling benchmark. When the top score is below 65%, it means frontier AI still has significant room to grow on the hardest reasoning tasks. This benchmark will be the last to saturate, and tracking its frontier progression is the purest signal of AGI-relevant progress.
Practical takeaways
By role
Do not build products that require HLE-level reasoning from AI today. No model reliably exceeds 65%. Use this benchmark to track when hard-reasoning products become viable.
HLE progress is the single best proxy for "are we getting closer to transformative AI?" A 10-point jump in 6 months signals acceleration.
The no-tools vs with-tools gap (typically 8-12 points) quantifies how much tool access compensates for reasoning limitations. Track this delta across model generations.
Frequently asked
About HLE
What does HLE measure?
HLE (Humanity's Last Exam) · a reasoning benchmark designed to be the hardest public evaluation of AI. Questions span mathematics, physics, philosophy, and logic · curated to be at or beyond the frontier of human expert capability. Tested with and without tool augmentation. Claude Opus 4.7 scores 46.9% without tools and 54.7% with tools · making it one of the few benchmarks where the top score is below 60%. 23 AI models have been tested on it. Scores range from 0.6 to 56.8 out of 100.
Which model leads on HLE?
Claude Mythos Preview from Anthropic leads HLE with a score of 56.8. The median score across 23 tested models is 15.4.
Is HLE saturated?
No · the top score is 56.8 out of 100 (57%). There is still meaningful room for improvement on HLE.
Does HLE predict performance on other benchmarks?
Yes · HLE scores correlate 0.95 with Cybench across 8 shared models. Models that do well on HLE tend to do well on Cybench.
How often is HLE data refreshed?
BenchGecko pulls updates daily. New model scores on HLE appear as soon as they are published by Epoch AI or the model provider.
- Category
- Reasoning
- Creator
- Scale AI / CAIS
- Max score
- 100
- Modality
- Text
- Scoring
- Accuracy on verified questions. Evaluated in two modes: no-tools (pure reasoning) and with-tools (code execution + search access). Scores reported separately per mode.
- Models
- 26
- Updated
- 2026-04-07
“HLE is the benchmark that will matter the longest. Everything else saturates; HLE endures. When a model breaks 80% here, the conversation about AI capabilities changes permanently.”
Top on HLE
Claude Mythos Preview · 56.8Claude Opus 4.7 · 56.8Gemini 3.1 Pro · 44.4GPT-5.4 · 42.7Gemini 3 Pro · 34.4More reasoning benchmarks
Same category · related evaluations