SWE-Bench verified
The Frontier
Best score over time · one chart, every benchmark
Distribution
Where models cluster
Correlated benchmarks
Pearson r · original research
Benchmarks that track with SWE-Bench verified
Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.
Full rankings
23 models tested · sorted by score
| # | Model | Score |
|---|---|---|
| 1 | 93.9 | |
| 2 | 78.7 | |
| 3 | 76.9 | |
| 4 | 76.7 | |
| 5 | 75.6 | |
| 6 | 75.4 | |
| 7 | 75.2 | |
| 8 | 74.8 | |
| 9 | 73.8 | |
| 10 | 73.8 | |
| 11 | 73.5 | |
| 12 | 73.3 | |
| 13 | 72.9 | |
| 14 | 72.1 | |
| 15 | 71.3 | |
| 16 | 70.7 | |
| 17 | 68.0 | |
| 18 | 64.7 | |
| 19 | 62.3 | |
| 20 | 61.0 | |
| 21 | 57.6 | |
| 22 | 48.5 | |
| 23 | 31.0 |
Frequently asked
Pulled from the SWE-Bench verified dataset · updated daily
What does SWE-Bench verified measure?
SWE-Bench verified is a code benchmark in the BenchGecko catalog. 23 AI models have been tested on it. Scores range from 31.0 to 93.9 out of 100.
Which model leads on SWE-Bench verified?
Claude Mythos Preview from Anthropic leads SWE-Bench verified with a score of 93.9. The median score across 23 tested models is 73.3.
Is SWE-Bench verified saturated?
No · the top score is 93.9 out of 100 (94%). There is still meaningful room for improvement on SWE-Bench verified.
Does SWE-Bench verified predict performance on other benchmarks?
Yes · SWE-Bench verified scores correlate 0.99 with SWE-Bench Verified (Bash Only) across 11 shared models. Models that do well on SWE-Bench verified tend to do well on SWE-Bench Verified (Bash Only).
How often is SWE-Bench verified data refreshed?
BenchGecko pulls updates daily. New model scores on SWE-Bench verified appear as soon as they are published by Epoch AI or the model provider.
More code benchmarks
Same category · related evaluations