Beta
Benchmark · Code

SWE-Bench verified

Updated 2026-04-07
Models tested
23
Top score
93.9
Claude Mythos Preview
Median
73.3
min 31.0
Top-5 spread
σ 6.8
wide open

Best score over time · one chart, every benchmark

SWE-BENCH VERIFIED22 MODELS · FRONTIER RUNNING MAX0255075100SCORE ↑Nov 24Mar 25Jul 25Dec 25Apr 26RELEASE DATE →benchgecko.ai/benchmark/swe-bench-verified · frontier
Frontier on SWE-Bench verified rose from 31.0 to 93.9 in 17 months · +62.9 points · latest leader Claude Mythos Preview from Anthropic.
Pink dots = frontier records · 9 totalClick to open model page

Where models cluster

SCORE DISTRIBUTION0–1010–2020–30130–40140–50150–60460–701570–8080–90190–100MEDIAN · 73.3SCORE BUCKET → (0 TO 100)MODELSbenchgecko.ai

Pearson r · original research

Correlation analysis

Benchmarks that track with SWE-Bench verified

Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.

23 models tested · sorted by score

Pulled from the SWE-Bench verified dataset · updated daily

What does SWE-Bench verified measure?

SWE-Bench verified is a code benchmark in the BenchGecko catalog. 23 AI models have been tested on it. Scores range from 31.0 to 93.9 out of 100.

Which model leads on SWE-Bench verified?

Claude Mythos Preview from Anthropic leads SWE-Bench verified with a score of 93.9. The median score across 23 tested models is 73.3.

Is SWE-Bench verified saturated?

No · the top score is 93.9 out of 100 (94%). There is still meaningful room for improvement on SWE-Bench verified.

Does SWE-Bench verified predict performance on other benchmarks?

Yes · SWE-Bench verified scores correlate 0.99 with SWE-Bench Verified (Bash Only) across 11 shared models. Models that do well on SWE-Bench verified tend to do well on SWE-Bench Verified (Bash Only).

How often is SWE-Bench verified data refreshed?

BenchGecko pulls updates daily. New model scores on SWE-Bench verified appear as soon as they are published by Epoch AI or the model provider.

Same category · related evaluations