Which model leads on Cybench?

Claude Opus 4.6 from Anthropic leads Cybench with a score of 93.0. The median score across 20 tested models is 18.8.

Is Cybench saturated?

No · the top score is 93.0 out of 100 (93%). There is still meaningful room for improvement on Cybench.

Yes · Cybench scores correlate 0.98 with GSO-Bench across 9 shared models. Models that do well on Cybench tend to do well on GSO-Bench.

BenchGecko pulls updates daily. New model scores on Cybench appear as soon as they are published by Epoch AI or the model provider.

Benchmark · CodeWide open

Name: Cybench Benchmark
Creator: BenchGecko
License: https://creativecommons.org/licenses/by/4.0/

Cybench · evaluates AI on real Capture-The-Flag cybersecurity challenges, testing vulnerability analysis, exploitation, and security reasoning.

Updated 2026-02-04

Models tested

Top score

93.0

Claude Opus 4.6

Median

18.8

min 5.0

Top-5 spread

σ 20.5

wide open

Best score over time · one chart, every benchmark

Chart type

Frontier on Cybench rose from 12.5 to 93.0 in 15 months · +80.5 points · latest leader Claude Opus 4.6 from Anthropic.

Pink dots = frontier records · 7 totalClick to open model page

20 models tested · sorted by score

#	Model	Score	Price
1	Claude Opus 4.6· Anthropic	93.0	$5.00
2	Claude Opus 4.5· Anthropic	82.0	$5.00
3	Claude Sonnet 4.5· Anthropic	60.0	$3.00
4	Grok 4· xAI	43.0	$3.00
5	Claude Opus 4.1· Anthropic	42.0	$15.00
6	Claude Opus 4· Anthropic	38.0	$15.00
7	Claude Sonnet 4· Anthropic	35.0	$3.00
8	Grok 4 Fast· xAI	30.0	$0.20
9	o3 Mini· OpenAI	22.5	$1.10
10	Claude 3.7 Sonnet· Anthropic	20.0	$3.00
11	Claude 3.5 Sonnet· Anthropic	17.5	—
12	GPT-4.5· OpenAI	17.5	—
13	GPT-4o (2024-11-20)· OpenAI	12.5	$2.50
14	Claude 3 Opus· Anthropic	10.0	—
15	o1-mini· OpenAI	10.0	—
16	o1-preview· OpenAI	10.0	—
17	Gemini 1.5 Pro (Feb 2024)· Google DeepMind	7.5	—
18	Llama 3.1 405B· Meta	7.5	—
19	Mixtral 8x22B Instruct· Mistral AI	7.5	$2.00
20	Llama 3 70B Instruct· Meta	5.0	$0.51