SWE-bench
A benchmark where models attempt real GitHub issues · judged by whether their patch passes the project's test suite.
A benchmark where models attempt real GitHub issues · judged by whether their patch passes the project's test suite.
Basic
SWE-bench (Princeton + Stanford, 2023) picks 2,294 pull requests from 12 popular Python repos. Each task gives the model the issue text and the codebase state before the fix. The model's patch is scored by running the project's own test suite. SWE-bench Verified is the human-validated 500-task subset most labs quote.
Deep
Each SWE-bench task is a real GitHub issue paired with the exact repo state before the merge. The model must output a patch that makes the hidden test suite pass without regressions. Popular variants: SWE-bench Verified (500 hand-checked tasks, the canonical leaderboard), SWE-bench Lite (300 easier tasks), Multimodal (screenshots), Multilingual (non-Python repos). The leaderboard separates agent-style solutions (tool use + multi-turn) from single-shot patch generation. Top of the board in 2026: Claude 4.5 Opus (high reasoning) ~77%, Claude Mythos Preview ~78%, GPT-5 ~71%, DeepSeek V3.2 ~64%.
Expert
SWE-bench scoring: an instance passes if and only if the modified repo passes the full pytest suite defined in the PR, including regression tests. The original paper showed frontier models scoring 2% · progress since then tracks improvements in tool use, long-context retrieval, and iterative repair rather than raw LLM capability. Agent scaffolding matters: same LLM + different agent scaffold produces 15-30 point score swings. Known critiques: test leakage (tests accessible to model), train-set contamination (popular repos in training data), and task difficulty distribution. SWE-bench Verified + Multimodal reduce leakage concerns; Multilingual (Aug 2024) extends coverage.
SWE-bench Verified became the de facto hiring signal for coding agents in 2025. Frontier labs now ship agent-specific SWE-bench tuning for every release.
Depending on why you're here
- ·2,294 real GitHub issues across 12 Python repos · Verified subset of 500 is canonical
- ·Scoring: modified repo must pass the original PR test suite
- ·Agent scaffold produces 15-30 point swings on same base model
- ·If you're shipping a coding agent, your SWE-bench score is your hiring reference
- ·Test your agent on SWE-bench Verified before on custom eval sets
- ·Claude + a proper agent scaffold hits 75%+ · LLM alone does not
- ·SWE-bench leadership correlates with coding-agent ARR
- ·Cursor, Cognition, Anthropic invest heavily in SWE-bench specialization
- ·Benchmark plateau signals commoditization approaching
- ·A test where AI tries to fix real bugs in real open-source projects
- ·Score = percentage of bugs it actually fixes right
- ·Top AI agents now fix 3 out of 4 · 2 years ago they fixed almost none
Often confused with
HumanEval is 164 hand-crafted Python problems with unit tests. SWE-bench is real GitHub issues in real repos. HumanEval saturated; SWE-bench is where the action is.
Aider polyglot tests editing across multiple languages. SWE-bench is Python-centric (Verified subset at least).
SWE-bench is the single most-watched AI benchmark of 2026. Every coding agent release ships a SWE-bench number first.
Frequently Asked Questions
Read the primary sources
- SWE-bench paper (Princeton, 2023)arxiv.org
- Official leaderboardwww.swebench.com
- GitHub · princeton-nlp/SWE-benchgithub.com