Benchmark · ReasoningWide open

HLE

HLE (Humanity's Last Exam) · a reasoning benchmark designed to be the hardest public evaluation of AI. Questions span mathematics, physics, philosophy, and logic · curated to be at or beyond the frontier of human expert capability. Tested with and without tool augmentation. Claude Opus 4.7 scores 46.9% without tools and 54.7% with tools · making it one of the few benchmarks where the top score is below 60%.

Updated 2026-04-07

Top score is below 65%. Most frontier models score below 55%. Humanity's Last Exam lives up to its name.

Scoring: Accuracy on verified questions. Evaluated in two modes: no-tools (pure reasoning) and with-tools (code execution + search access). Scores reported separately per mode.

Models tested
26
Top score
56.8
Claude Mythos Preview
Median
15.4
min 0.6
Top-5 spread
σ 11.4
wide open

Best score over time · one chart, every benchmark

HLE20 MODELS · FRONTIER RUNNING MAX0255075100SCORE ↑Feb 25Jun 25Sep 25Dec 25Apr 26RELEASE DATE →benchgecko.ai/benchmark/hle · frontier
Frontier on HLE rose from 3.4 to 56.8 in 14 months · +53.4 points · latest leader Claude Mythos Preview from Anthropic.
Pink dots = frontier records · 7 totalClick to open model page
Details
Category
Reasoning
Creator
Scale AI / CAIS
Max score
100
Modality
Text
Scoring
Accuracy on verified questions. Evaluated in two modes: no-tools (pure reasoning) and with-tools (code execution + search access). Scores reported separately per mode.
Models
26
Updated
2026-04-07
Tests
Expert reasoningMathematicsPhysicsPhilosophyCross-domain logic
Does not test
CodeVisionSpeedTool useLong context
Gecko's Take

HLE is the benchmark that will matter the longest. Everything else saturates; HLE endures. When a model breaks 80% here, the conversation about AI capabilities changes permanently.

Same category · related evaluations