SWE-Bench verified
SWE-bench Verified · 500 human-validated tasks from 12 real Python repositories (Django, Flask, scikit-learn, sympy, and others). Each task requires the model to produce a git patch that resolves a real GitHub issue and passes the test suite. The verified subset eliminates ambiguous tasks from the original SWE-bench. Claude Mythos Preview leads at 93.9%, crossing 90% for the first time in 2026. Opus 4.6 scores 80.8%. The benchmark remains the most-cited evaluation for code-generation capability.
Mythos 93.9%. Opus 4.6 80.8%. The frontier crossed 90% in 2026.
Scoring: Binary pass/fail per task. The model's git patch must apply cleanly to the repository and all associated test cases must pass. No partial credit. Final score = percentage of tasks passed out of 500.
The Frontier
Best score over time · one chart, every benchmark
Full rankings
24 models tested · sorted by score · includes 3 verified scores
| # | Model | Score |
|---|---|---|
| 1 | 93.9 | |
| 2 | 87.6 | |
| 3 | 78.7 | |
| 4 | 76.9 | |
| 5 | 76.7 | |
| 6 | 75.6 | |
| 7 | 75.4 | |
| 8 | 75.2 | |
| 9 | 74.8 | |
| 10 | 73.8 | |
| 11 | 73.8 | |
| 12 | 73.5 | |
| 13 | 73.3 | |
| 14 | 72.9 | |
| 15 | 72.1 | |
| 16 | 71.3 | |
| 17 | 70.7 | |
| 18 | 68.0 | |
| 19 | 64.7 | |
| 20 | 62.3 | |
| 21 | 61.0 | |
| 22 | 57.6 | |
| 23 | 48.5 | |
| 24 | 31.0 |
Score distribution
Where models cluster
Correlated benchmarks
Pearson r · original research
Benchmarks that track with SWE-Bench verified
Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.
How it works
Evaluation methodology
SWE-bench Verified contains 500 human-validated tasks drawn from 12 real Python open-source repositories including Django, Flask, requests, scikit-learn, sympy, matplotlib, and others. Each task is a real GitHub issue with an associated test suite. The model receives the repository state, the issue description, and must produce a git patch that resolves the issue. Success is binary: the patch must apply cleanly and all associated tests must pass. The "verified" subset was human-curated to remove ambiguous or under-specified tasks from the original 2,294-task SWE-bench.
Industry relevance
Why teams track this benchmark
SWE-bench Verified is the most widely cited benchmark for AI code generation. It tests end-to-end software engineering: understanding the bug report, navigating the codebase, writing the fix, and ensuring it passes tests. Crossing 90% (Mythos at 93.9%) signals that AI can reliably fix most real-world bugs without human intervention.
Practical takeaways
By role
Models above 85% on SWE-bench Verified can handle most automated bug-fix workflows. Build CI/CD integrations that auto-generate patches for failing tests.
The 90% milestone signals a market inflection for AI-assisted software development. Companies building on top-scoring models have a credible path to autonomous code maintenance.
The full dataset and evaluation harness are open-source on GitHub. 500 verified tasks across 12 repos. The most reproducible evaluation for code generation research.
Frequently asked
About SWE-Bench verified
What does SWE-Bench verified measure?
SWE-bench Verified · 500 human-validated tasks from 12 real Python repositories (Django, Flask, scikit-learn, sympy, and others). Each task requires the model to produce a git patch that resolves a real GitHub issue and passes the test suite. The verified subset eliminates ambiguous tasks from the original SWE-bench. Claude Mythos Preview leads at 93.9%, crossing 90% for the first time in 2026. Opus 4.6 scores 80.8%. The benchmark remains the most-cited evaluation for code-generation capability. 23 AI models have been tested on it. Scores range from 31.0 to 93.9 out of 100.
Which model leads on SWE-Bench verified?
Claude Mythos Preview from Anthropic leads SWE-Bench verified with a score of 93.9. The median score across 23 tested models is 73.3.
Is SWE-Bench verified saturated?
No · the top score is 93.9 out of 100 (94%). There is still meaningful room for improvement on SWE-Bench verified.
Does SWE-Bench verified predict performance on other benchmarks?
Yes · SWE-Bench verified scores correlate 0.99 with SWE-Bench Verified (Bash Only) across 11 shared models. Models that do well on SWE-Bench verified tend to do well on SWE-Bench Verified (Bash Only).
How often is SWE-Bench verified data refreshed?
BenchGecko pulls updates daily. New model scores on SWE-Bench verified appear as soon as they are published by Epoch AI or the model provider.
- Category
- Code
- Max score
- 100
- Scoring
- Binary pass/fail per task. The model's git patch must apply cleanly to the repository and all associated test cases must pass. No partial credit. Final score = percentage of tasks passed out of 500.
- Models
- 24
- Updated
- 2026-04-07
“SWE-bench Verified crossing 90% is the milestone that changes software engineering economics. Models that score above 85% here can handle the majority of bug-fix PRs autonomously. The race now shifts to SWE-bench Pro and harder evaluations.”
Top on SWE-Bench verified
Claude Mythos Preview · 93.9Claude Opus 4.7 · 87.6Claude Opus 4.6 · 78.7GPT-5.4 · 76.9Claude Opus 4.5 · 76.7More code benchmarks
Same category · related evaluations