Is SWE-bench Python-only?

Verified is Python. SWE-bench Multilingual extends to TypeScript, JavaScript, Go, Rust, and Java.

Can I run SWE-bench locally?

Yes. The repo at github.com/princeton-nlp/SWE-bench includes the evaluation harness. Each task runs in a Docker container with the exact repo state.

BenchmarksReading · ~3 min · 86 words deep

SWE-bench

A benchmark where models attempt real GitHub issues · judged by whether their patch passes the project's test suite.

SWE-bench leaderboard

TL;DR

A benchmark where models attempt real GitHub issues · judged by whether their patch passes the project's test suite.

Live data · updated daily

Top 10 on SWE-bench

See full leaderboard

Claude Mythos Preview

93.9%

Claude Opus 4.6

78.7%

GPT-5.4

76.9%

Claude Opus 4.5

76.7%

Gemini 3.1 Pro Preview

75.6%

Gemini 3 Flash Preview

75.4%

75.2%

74.8%

73.8%

73.8%

Level 1

Basic

SWE-bench (Princeton + Stanford, 2023) picks 2,294 pull requests from 12 popular Python repos. Each task gives the model the issue text and the codebase state before the fix. The model's patch is scored by running the project's own test suite. SWE-bench Verified is the human-validated 500-task subset most labs quote.

Level 2

Deep

Each SWE-bench task is a real GitHub issue paired with the exact repo state before the merge. The model must output a patch that makes the hidden test suite pass without regressions. Popular variants: SWE-bench Verified (500 hand-checked tasks, the canonical leaderboard), SWE-bench Lite (300 easier tasks), Multimodal (screenshots), Multilingual (non-Python repos). The leaderboard separates agent-style solutions (tool use + multi-turn) from single-shot patch generation. Top of the board in 2026: Claude 4.5 Opus (high reasoning) ~77%, Claude Mythos Preview ~78%, GPT-5 ~71%, DeepSeek V3.2 ~64%.

Level 3

Expert

SWE-bench scoring: an instance passes if and only if the modified repo passes the full pytest suite defined in the PR, including regression tests. The original paper showed frontier models scoring 2% · progress since then tracks improvements in tool use, long-context retrieval, and iterative repair rather than raw LLM capability. Agent scaffolding matters: same LLM + different agent scaffold produces 15-30 point score swings. Known critiques: test leakage (tests accessible to model), train-set contamination (popular repos in training data), and task difficulty distribution. SWE-bench Verified + Multimodal reduce leakage concerns; Multilingual (Aug 2024) extends coverage.

Why this matters now

SWE-bench Verified became the de facto hiring signal for coding agents in 2025. Frontier labs now ship agent-specific SWE-bench tuning for every release.

The takeaway for you

Depending on why you're here

If you are a

Researcher

·2,294 real GitHub issues across 12 Python repos · Verified subset of 500 is canonical
·Scoring: modified repo must pass the original PR test suite
·Agent scaffold produces 15-30 point swings on same base model

If you are a

Builder

·If you're shipping a coding agent, your SWE-bench score is your hiring reference
·Test your agent on SWE-bench Verified before on custom eval sets
·Claude + a proper agent scaffold hits 75%+ · LLM alone does not

If you are a

Investor

·SWE-bench leadership correlates with coding-agent ARR
·Cursor, Cognition, Anthropic invest heavily in SWE-bench specialization
·Benchmark plateau signals commoditization approaching

If you are a

Curious · Normie

·A test where AI tries to fix real bugs in real open-source projects
·Score = percentage of bugs it actually fixes right
·Top AI agents now fix 3 out of 4 · 2 years ago they fixed almost none

Don't mix them up

Often confused with

SWE-benchvsHumanEval

HumanEval is 164 hand-crafted Python problems with unit tests. SWE-bench is real GitHub issues in real repos. HumanEval saturated; SWE-bench is where the action is.

SWE-benchvsAider polyglot

Aider polyglot tests editing across multiple languages. SWE-bench is Python-centric (Verified subset at least).

Gecko's take

SWE-bench is the single most-watched AI benchmark of 2026. Every coding agent release ships a SWE-bench number first.

Frequently Asked Questions

A 500-task human-validated subset of SWE-bench filtered for task quality and test correctness. It's the canonical version the frontier labs report.

Canonical sources

Read the primary sources

SWE-bench paper (Princeton, 2023)arxiv.org
Official leaderboardwww.swebench.com
GitHub · princeton-nlp/SWE-benchgithub.com

SWE-bench

Basic

Deep

Expert

Depending on why you're here

Often confused with

Frequently Asked Questions

Read the primary sources

Related terms

Glossary

Explore live data

Cite or embed