Benchmark · CodeSettled

SWE-Bench Verified (Bash Only)

SWE-Bench Verified (Bash Only) · a curated subset of SWE-bench where models fix real Python repository bugs using only bash commands, no agent frameworks.

Updated 2025-12-10
Models tested
19
Top score
74.4
Claude Opus 4.5
Median
58.4
min 9.1
Top-5 spread
σ 3.0
Competitive

Best score over time · one chart, every benchmark

SWE-BENCH VERIFIED (BASH ONLY)19 MODELS · FRONTIER RUNNING MAX0255075100SCORE ↑Nov 24Feb 25May 25Sep 25Dec 25RELEASE DATE →benchgecko.ai/benchmark/swe-bench-verified-bash-only · frontier
Frontier on SWE-Bench Verified (Bash Only) rose from 21.6 to 74.4 in 12 months · +52.8 points · latest leader Claude Opus 4.5 from Anthropic.
Pink dots = frontier records · 6 totalClick to open model page

Same category · related evaluations