Benchmark · CodeWide open

Cybench

Cybench · evaluates AI on real Capture-The-Flag cybersecurity challenges, testing vulnerability analysis, exploitation, and security reasoning.

Updated 2026-02-04
Models tested
20
Top score
93.0
Claude Opus 4.6
Median
18.8
min 5.0
Top-5 spread
σ 20.5
wide open

Best score over time · one chart, every benchmark

CYBENCH12 MODELS · FRONTIER RUNNING MAX0255075100SCORE ↑Jul 24Dec 24Apr 25Sep 25Feb 26RELEASE DATE →benchgecko.ai/benchmark/cybench · frontier
Frontier on Cybench rose from 12.5 to 93.0 in 15 months · +80.5 points · latest leader Claude Opus 4.6 from Anthropic.
Pink dots = frontier records · 7 totalClick to open model page

Same category · related evaluations