Cybench
Cybench · evaluates AI on real Capture-The-Flag cybersecurity challenges, testing vulnerability analysis, exploitation, and security reasoning.
The Frontier
Best score over time · one chart, every benchmark
Full rankings
20 models tested · sorted by score
| # | Model | Score |
|---|---|---|
| 1 | 93.0 | |
| 2 | 82.0 | |
| 3 | 60.0 | |
| 4 | 43.0 | |
| 5 | 42.0 | |
| 6 | 38.0 | |
| 7 | 35.0 | |
| 8 | 30.0 | |
| 9 | 22.5 | |
| 10 | 20.0 | |
| 11 | 17.5 | |
| 12 | 17.5 | |
| 13 | 12.5 | |
| 14 | 10.0 | |
| 15 | 10.0 | |
| 16 | 10.0 | |
| 17 | 7.5 | |
| 18 | 7.5 | |
| 19 | 7.5 | |
| 20 | 5.0 |
Score distribution
Where models cluster
Correlated benchmarks
Pearson r · original research
Benchmarks that track with Cybench
Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.
Frequently asked
About Cybench
What does Cybench measure?
Cybench · evaluates AI on real Capture-The-Flag cybersecurity challenges, testing vulnerability analysis, exploitation, and security reasoning. 20 AI models have been tested on it. Scores range from 5.0 to 93.0 out of 100.
Which model leads on Cybench?
Claude Opus 4.6 from Anthropic leads Cybench with a score of 93.0. The median score across 20 tested models is 18.8.
Is Cybench saturated?
No · the top score is 93.0 out of 100 (93%). There is still meaningful room for improvement on Cybench.
Does Cybench predict performance on other benchmarks?
Yes · Cybench scores correlate 0.98 with GSO-Bench across 9 shared models. Models that do well on Cybench tend to do well on GSO-Bench.
How often is Cybench data refreshed?
BenchGecko pulls updates daily. New model scores on Cybench appear as soon as they are published by Epoch AI or the model provider.
- Category
- Code
- Max score
- 100
- Models
- 20
- Updated
- 2026-02-04
Top on Cybench
Claude Opus 4.6 · 93.0Claude Opus 4.5 · 82.0Claude Sonnet 4.5 · 60.0Grok 4 · 43.0Claude Opus 4.1 · 42.0More code benchmarks
Same category · related evaluations