Which model leads on HLE?

Claude Mythos Preview from Anthropic leads HLE with a score of 56.8. The median score across 23 tested models is 15.4.

No · the top score is 56.8 out of 100 (57%). There is still meaningful room for improvement on HLE.

Does HLE predict performance on other benchmarks?

Yes · HLE scores correlate 0.95 with Cybench across 8 shared models. Models that do well on HLE tend to do well on Cybench.

How often is HLE data refreshed?

BenchGecko pulls updates daily. New model scores on HLE appear as soon as they are published by Epoch AI or the model provider.

Benchmark · ReasoningWide open

HLE

Name: HLE Benchmark
Creator: BenchGecko
License: https://creativecommons.org/licenses/by/4.0/

HLE (Humanity's Last Exam) · a reasoning benchmark designed to be the hardest public evaluation of AI. Questions span mathematics, physics, philosophy, and logic · curated to be at or beyond the frontier of human expert capability. Tested with and without tool augmentation. Claude Opus 4.7 scores 46.9% without tools and 54.7% with tools · making it one of the few benchmarks where the top score is below 60%.

Updated 2026-04-07

Top score is below 65%. Most frontier models score below 55%. Humanity's Last Exam lives up to its name.

Scoring: Accuracy on verified questions. Evaluated in two modes: no-tools (pure reasoning) and with-tools (code execution + search access). Scores reported separately per mode.

Models tested

Top score

56.8

Claude Mythos Preview

Median

15.4

min 0.6

Top-5 spread

σ 11.4

wide open

The Frontier

Best score over time · one chart, every benchmark

Chart type

Frontier on HLE rose from 3.4 to 56.8 in 14 months · +53.4 points · latest leader Claude Mythos Preview from Anthropic.

Pink dots = frontier records · 7 totalClick to open model page

Full rankings

26 models tested · sorted by score · includes 5 verified scores

#	Model	Score	Price	Source
1	Claude Mythos Preview· Anthropic	56.8	—	Anthropic Mythos System Card (no-tools), Apr 2026
2	Claude Opus 4.7· Anthropicverified	56.8	—	Artificial Analysis Intelligence Index (max effort), Apr 2026
3	Gemini 3.1 Pro· Googleverified	44.4	—	Google Gemini 3.1 Pro Announcement (no-tools), Mar 2026
4	GPT-5.4· OpenAIverified	42.7	—	OpenAI GPT-5.4 Blog (no-tools), Mar 2026
5	Gemini 3 Pro· Google DeepMind	34.4	—
6	Claude Opus 4.6· Anthropic	31.1	$5.00	Anthropic Opus 4.6 System Card (no-tools), Feb 2026
7	GPT-5 Pro· OpenAI	28.2	$15.00
8	GPT-5.2· OpenAI	24.2	$1.75
9	GPT-5· OpenAI	21.6	$1.25
10	Claude Opus 4.5· Anthropic	21.4	$5.00
11	Kimi K2.5· moonshotai	20.6	$0.44
12	GPT-5.1· OpenAI	19.8	$1.25
13	Gemini 2.5 Pro· Google DeepMind	17.7	$1.25
14	o3· OpenAI	16.3	$2.00
15	GPT-5 Mini· OpenAI	15.4	$0.25
16	o4 Mini· OpenAI	13.9	$1.10
17	Claude Sonnet 4.5· Anthropic	9.4	$3.00
18	Gemini 2.5 Flash· Google DeepMind	7.7	$0.30
19	Claude Opus 4.1· Anthropic	7.1	$15.00
20	Claude Opus 4· Anthropic	6.2	$15.00
21	Claude 3.7 Sonnet· Anthropic	3.4	$3.00
22	Claude Sonnet 4· Anthropic	3.1	$3.00
23	Gemini 2.0 Flash Thinking (Jan 2025)· Google DeepMind	1.9	—
24	Llama 4 Maverick· Meta	0.9	$0.15
25	GPT-4.5· OpenAI	0.7	—
26	GPT-4.1· OpenAI	0.6	$2.00

Details

Category: Reasoning
Creator: Scale AI / CAIS
Max score: 100
Modality: Text
Scoring: Accuracy on verified questions. Evaluated in two modes: no-tools (pure reasoning) and with-tools (code execution + search access). Scores reported separately per mode.
Models: 26
Updated: 2026-04-07

Tests

Expert reasoningMathematicsPhysicsPhilosophyCross-domain logic

Does not test

CodeVisionSpeedTool useLong context

Links

Anthropic Opus 4.7 Announcement Anthropic Opus 4.6 System Card OpenAI GPT-5.4 Blog Google Gemini 3.1 Pro Announcement Learn more

Gecko's Take

“HLE is the benchmark that will matter the longest. Everything else saturates; HLE endures. When a model breaks 80% here, the conversation about AI capabilities changes permanently.”

Related benchmarks