#	Model	Score	Price
1	Claude Mythos Preview· Anthropic	93.9	—
2	Claude Opus 4.6· Anthropic	78.7	$5.00
3	GPT-5.4· OpenAI	76.9	$2.50
4	Claude Opus 4.5· Anthropic	76.7	$5.00
5	Gemini 3.1 Pro Preview· Google DeepMind	75.6	$2.00
6	Gemini 3 Flash Preview· Google DeepMind	75.4	$0.50
7	Claude Sonnet 4.6· Anthropic	75.2	$3.00
8	GPT-5.3-Codex· OpenAI	74.8	$1.75
9	GPT-5.2· OpenAI	73.8	$1.75
10	Kimi K2.5· moonshotai	73.8	$0.38
11	GPT-5· OpenAI	73.5	$1.25
12	Claude Opus 4.1· Anthropic	73.3	$15.00
13	Gemini 3 Pro· Google DeepMind	72.9	—
14	GLM 5· z-ai	72.1	$0.72
15	Claude Sonnet 4.5· Anthropic	71.3	$3.00
16	Claude Opus 4· Anthropic	70.7	$15.00
17	GPT-5.1· OpenAI	68.0	$1.25
18	GPT-5 Mini· OpenAI	64.7	$0.25
19	o3· OpenAI	62.3	$2.00
20	Claude 3.7 Sonnet· Anthropic	61.0	$3.00
21	Gemini 2.5 Pro· Google DeepMind	57.6	$1.25
22	GPT-4.1· OpenAI	48.5	$2.00
23	GPT-4o (2024-11-20)· OpenAI	31.0	$2.50

Frequently asked

Pulled from the SWE-Bench verified dataset · updated daily

What does SWE-Bench verified measure?

SWE-Bench verified is a code benchmark in the BenchGecko catalog. 23 AI models have been tested on it. Scores range from 31.0 to 93.9 out of 100.

Which model leads on SWE-Bench verified?

Claude Mythos Preview from Anthropic leads SWE-Bench verified with a score of 93.9. The median score across 23 tested models is 73.3.

Is SWE-Bench verified saturated?

No · the top score is 93.9 out of 100 (94%). There is still meaningful room for improvement on SWE-Bench verified.

Does SWE-Bench verified predict performance on other benchmarks?

Yes · SWE-Bench verified scores correlate 0.99 with SWE-Bench Verified (Bash Only) across 11 shared models. Models that do well on SWE-Bench verified tend to do well on SWE-Bench Verified (Bash Only).

How often is SWE-Bench verified data refreshed?

BenchGecko pulls updates daily. New model scores on SWE-Bench verified appear as soon as they are published by Epoch AI or the model provider.