Benchmark · CodeSettled

Terminal Bench

Terminal-Bench 2.0 · evaluates AI agents on real terminal-based coding tasks · writing scripts, debugging, running tests, and managing projects entirely through command-line interaction. Tests both code quality and terminal fluency. Claude Opus 4.7 scores 69.4%, demonstrating significant agentic terminal competence.

Updated 2026-04-23

Tests both code quality AND terminal fluency. Opus 4.7 scores 69.4%.

Scoring: Task completion rate based on filesystem state comparison. The model's terminal session produces a final state that is diff-checked against the expected output.

Models tested
31
Top score
82.7
GPT-5.5
Median
42.7
min 11.5
Top-5 spread
σ 3.0
Competitive

Best score over time · one chart, every benchmark

TERMINAL BENCH27 MODELS · FRONTIER RUNNING MAX0255075100SCORE ↑Jun 25Sep 25Nov 25Feb 26Apr 26RELEASE DATE →benchgecko.ai/benchmark/terminal-bench · frontier
Frontier on Terminal Bench rose from 17.1 to 82.7 in 10 months · +65.6 points · latest leader GPT-5.5 from OpenAI.
Pink dots = frontier records · 10 totalClick to open model page
Details
Category
Code
Creator
Terminal Research
Max score
100
Modality
Code
Scoring
Task completion rate based on filesystem state comparison. The model's terminal session produces a final state that is diff-checked against the expected output.
Models
31
Updated
2026-04-23
Tests
Terminal fluencyScript writingDebuggingGit operationsMulti-file projects
Does not test
VisionLong contextScientific reasoningSafety
Gecko's Take

Terminal-Bench is the developer's benchmark. It tests exactly what AI coding assistants do: write code, debug it, run tests, all through the terminal. Mythos at 82.0% means terminal-based AI coding is approaching reliability.

Same category · related evaluations