Benchmark · CodeCompetitive

GSO-Bench

GSO-Bench · evaluates AI models on real-world open-source software engineering tasks, testing the ability to understand and resolve actual GitHub issues.

Updated 2026-02-04
Models tested
18
Top score
33.3
Claude Opus 4.6
Median
6.9
min 0.1
Top-5 spread
σ 6.6
wide open

Best score over time · one chart, every benchmark

GSO-BENCH16 MODELS · FRONTIER RUNNING MAX0255075100SCORE ↑Nov 24Mar 25Jun 25Oct 25Feb 26RELEASE DATE →benchgecko.ai/benchmark/gso-bench · frontier
Frontier on GSO-Bench rose from 0.1 to 33.3 in 15 months · +33.2 points · latest leader Claude Opus 4.6 from Anthropic.
Pink dots = frontier records · 8 totalClick to open model page

Same category · related evaluations