Which model leads on Terminal Bench?

GPT-5.5 from OpenAI leads Terminal Bench with a score of 82.7. The median score across 28 tested models is 42.7.

Is Terminal Bench saturated?

No · the top score is 82.7 out of 100 (83%). There is still meaningful room for improvement on Terminal Bench.

Does Terminal Bench predict performance on other benchmarks?

Yes · Terminal Bench scores correlate 0.98 with Cybench across 5 shared models. Models that do well on Terminal Bench tend to do well on Cybench.

How often is Terminal Bench data refreshed?

BenchGecko pulls updates daily. New model scores on Terminal Bench appear as soon as they are published by Epoch AI or the model provider.

Benchmark · CodeSettled

Terminal Bench

Name: Terminal Bench Benchmark
Creator: BenchGecko
License: https://creativecommons.org/licenses/by/4.0/

Terminal-Bench 2.0 · evaluates AI agents on real terminal-based coding tasks · writing scripts, debugging, running tests, and managing projects entirely through command-line interaction. Tests both code quality and terminal fluency. Claude Opus 4.7 scores 69.4%, demonstrating significant agentic terminal competence.

Updated 2026-04-23

Tests both code quality AND terminal fluency. Opus 4.7 scores 69.4%.

Scoring: Task completion rate based on filesystem state comparison. The model's terminal session produces a final state that is diff-checked against the expected output.

Models tested

Top score

82.7

GPT-5.5

Median

42.7

min 11.5

Top-5 spread

σ 3.0

Competitive

The Frontier

Best score over time · one chart, every benchmark

Chart type

Frontier on Terminal Bench rose from 17.1 to 82.7 in 10 months · +65.6 points · latest leader GPT-5.5 from OpenAI.

Pink dots = frontier records · 10 totalClick to open model page

Full rankings

31 models tested · sorted by score · includes 6 verified scores

#	Model	Score	Price	Source
1	GPT-5.5· OpenAI	82.7	$5.00	OpenAI GPT-5.5 Blog, Apr 2026
2	Claude Mythos Preview· Anthropic	82.0	—	Anthropic Mythos System Card, Apr 2026
3	Gemini 3.1 Pro Preview· Google DeepMind	78.4	$2.00
4	GPT-5.3-Codex· OpenAI	77.3	$1.75
5	GPT-5.4· OpenAIverified	75.1	—	OpenAI GPT-5.4 Blog, Mar 2026
6	Claude Opus 4.6· Anthropic	74.7	$5.00	Anthropic Opus 4.6 System Card, Feb 2026
7	Gemini 3 Pro· Google DeepMind	69.4	—
8	Claude Opus 4.7· Anthropicverified	69.4	—	Anthropic Opus 4.7 Announcement, Apr 2026
9	Gemini 3.1 Pro· Googleverified	68.5	—	Google Gemini 3.1 Pro Announcement, Mar 2026
10	GPT-5.2· OpenAI	64.9	$1.75
11	Gemini 3 Flash Preview· Google DeepMind	64.3	$0.50
12	Claude Opus 4.5· Anthropic	63.1	$5.00
13	GLM 5· z-ai	52.4	$0.60
14	GPT-5· OpenAI	49.6	$1.25
15	GPT-5.1· OpenAI	47.6	$1.25
16	Claude Sonnet 4.5· Anthropic	46.5	$3.00
17	Kimi K2.5· moonshotai	43.2	$0.44
18	MiniMax M2.5· minimax	42.2	$0.15
19	DeepSeek V3.2· DeepSeek	39.6	$0.25
20	Claude Opus 4.1· Anthropic	38.0	$15.00
21	Kimi K2 Thinking· moonshotai	35.7	$0.60
22	Claude Haiku 4.5· Anthropic	35.5	$1.00
23	GPT-5 Mini· OpenAI	34.8	$0.25
24	GLM 4.7· z-ai	33.4	$0.38
25	Gemini 2.5 Pro· Google DeepMind	32.6	$1.25
26	Kimi K2 0711· moonshotai	27.8	$0.57
27	Grok 4· xAI	27.2	$3.00
28	GLM 4.6· z-ai	24.5	$0.39
29	gpt-oss-120b· OpenAI	18.7	$0.04
30	Gemini 2.5 Flash· Google DeepMind	17.1	$0.30
31	GPT-5 Nano· OpenAI	11.5	$0.05

Details

Category: Code
Creator: Terminal Research
Max score: 100
Modality: Code
Scoring: Task completion rate based on filesystem state comparison. The model's terminal session produces a final state that is diff-checked against the expected output.
Models: 31
Updated: 2026-04-23

Tests

Terminal fluencyScript writingDebuggingGit operationsMulti-file projects

Does not test

VisionLong contextScientific reasoningSafety

Links

Anthropic Opus 4.7 Announcement OpenAI GPT-5.4 Blog Learn more

Gecko's Take

“Terminal-Bench is the developer's benchmark. It tests exactly what AI coding assistants do: write code, debug it, run tests, all through the terminal. Mythos at 82.0% means terminal-based AI coding is approaching reliability.”

Related benchmarks