Benchmark · CodeCompetitive

SWE-Bench verified

SWE-bench Verified · 500 human-validated tasks from 12 real Python repositories (Django, Flask, scikit-learn, sympy, and others). Each task requires the model to produce a git patch that resolves a real GitHub issue and passes the test suite. The verified subset eliminates ambiguous tasks from the original SWE-bench. Claude Mythos Preview leads at 93.9%, crossing 90% for the first time in 2026. Opus 4.6 scores 80.8%. The benchmark remains the most-cited evaluation for code-generation capability.

Updated 2026-04-07

Mythos 93.9%. Opus 4.6 80.8%. The frontier crossed 90% in 2026.

Scoring: Binary pass/fail per task. The model's git patch must apply cleanly to the repository and all associated test cases must pass. No partial credit. Final score = percentage of tasks passed out of 500.

Models tested
24
Top score
93.9
Claude Mythos Preview
Median
73.3
min 31.0
Top-5 spread
σ 6.8
wide open

Best score over time · one chart, every benchmark

SWE-BENCH VERIFIED22 MODELS · FRONTIER RUNNING MAX0255075100SCORE ↑Nov 24Mar 25Jul 25Dec 25Apr 26RELEASE DATE →benchgecko.ai/benchmark/swe-bench-verified · frontier
Frontier on SWE-Bench verified rose from 31.0 to 93.9 in 17 months · +62.9 points · latest leader Claude Mythos Preview from Anthropic.
Pink dots = frontier records · 9 totalClick to open model page
Details
Category
Code
Max score
100
Scoring
Binary pass/fail per task. The model's git patch must apply cleanly to the repository and all associated test cases must pass. No partial credit. Final score = percentage of tasks passed out of 500.
Models
24
Updated
2026-04-07
Gecko's Take

SWE-bench Verified crossing 90% is the milestone that changes software engineering economics. Models that score above 85% here can handle the majority of bug-fix PRs autonomously. The race now shifts to SWE-bench Pro and harder evaluations.

Same category · related evaluations