SWE-Bench Verified (Bash Only)
SWE-Bench Verified (Bash Only) β a curated subset of SWE-bench where models fix real Python repository bugs using only bash commands, no agent frameworks.
32
Models Tested
74.4
Top Score
49.4
Average Score
Rankings
| # | Model | Score | Bar |
|---|---|---|---|
| 1 | 74.4 | ||
| 2 | 74.2 | ||
| 3 | 71.8 | ||
| 4 | 71.8 | ||
| 5 | 70.6 | ||
| 6 | 67.6 | ||
| 7 | 66.0 | ||
| 8 | 66.0 | ||
| 9 | 65.0 | ||
| 10 | 65.0 | ||
| 11 | 64.9 | ||
| 12 | M Kimi K2 Thinkingmoonshotai | 63.4 | |
| 13 | 59.8 | ||
| 14 | 58.4 | ||
| 15 | 55.4 | ||
| 16 | 55.4 | ||
| 17 | ZA GLM 4.5z-ai | 54.2 | |
| 18 | 52.8 | ||
| 19 | 52.8 | ||
| 20 | 45.0 | ||
| 21 | M Kimi K2 0711moonshotai | 43.8 | |
| 22 | 39.6 | ||
| 23 | 34.8 | ||
| 24 | 28.7 | ||
| 25 | 26.0 | ||
| 26 | 26.0 | ||
| 27 | 26.0 | ||
| 28 | 26.0 | ||
| 29 | 23.9 | ||
| 30 | 21.6 | ||
| 31 | 21.0 | ||
| 32 | 9.1 |