Terminal Bench
Terminal-Bench 2.0 · evaluates AI agents on real terminal-based coding tasks · writing scripts, debugging, running tests, and managing projects entirely through command-line interaction. Tests both code quality and terminal fluency. Claude Opus 4.7 scores 69.4%, demonstrating significant agentic terminal competence.
Tests both code quality AND terminal fluency. Opus 4.7 scores 69.4%.
Scoring: Task completion rate based on filesystem state comparison. The model's terminal session produces a final state that is diff-checked against the expected output.
The Frontier
Best score over time · one chart, every benchmark
Full rankings
31 models tested · sorted by score · includes 6 verified scores
| # | Model | Score |
|---|---|---|
| 1 | 82.7 | |
| 2 | 82.0 | |
| 3 | 78.4 | |
| 4 | 77.3 | |
| 5 | 75.1 | |
| 6 | 74.7 | |
| 7 | 69.4 | |
| 8 | 69.4 | |
| 9 | 68.5 | |
| 10 | 64.9 | |
| 11 | 64.3 | |
| 12 | 63.1 | |
| 13 | 52.4 | |
| 14 | 49.6 | |
| 15 | 47.6 | |
| 16 | 46.5 | |
| 17 | 43.2 | |
| 18 | 42.2 | |
| 19 | 39.6 | |
| 20 | 38.0 | |
| 21 | 35.7 | |
| 22 | 35.5 | |
| 23 | 34.8 | |
| 24 | 33.4 | |
| 25 | 32.6 | |
| 26 | 27.8 | |
| 27 | 27.2 | |
| 28 | 24.5 | |
| 29 | 18.7 | |
| 30 | 17.1 | |
| 31 | 11.5 |
Score distribution
Where models cluster
Correlated benchmarks
Pearson r · original research
Benchmarks that track with Terminal Bench
Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.
How it works
Evaluation methodology
Terminal-Bench 2.0 evaluates AI agents on real terminal-based coding tasks that must be completed entirely through command-line interaction. Tasks include writing scripts, debugging existing code, running and interpreting test suites, managing git repositories, and orchestrating multi-file projects. Models interact with a real shell environment and are evaluated on whether the final state of the filesystem matches the expected output. Tasks cover Python, JavaScript, Bash, and system administration.
Industry relevance
Why teams track this benchmark
Terminal-Bench directly measures the capability that powers AI coding assistants like Claude Code, Cursor, and Windsurf. A model's Terminal-Bench score predicts how well it will perform as an interactive coding partner. This is the benchmark developers care about most.
Practical takeaways
By role
Terminal-Bench scores directly predict AI coding assistant quality. If evaluating models for a coding product, this is your primary benchmark.
Terminal-Bench leaders capture the developer tools market. GPT-5.4 at 75.1% and Mythos at 82.0% are the current production-ready tier.
The multi-step nature of terminal tasks makes this ideal for studying planning and error recovery in coding agents.
Frequently asked
About Terminal Bench
What does Terminal Bench measure?
Terminal-Bench 2.0 · evaluates AI agents on real terminal-based coding tasks · writing scripts, debugging, running tests, and managing projects entirely through command-line interaction. Tests both code quality and terminal fluency. Claude Opus 4.7 scores 69.4%, demonstrating significant agentic terminal competence. 28 AI models have been tested on it. Scores range from 11.5 to 82.7 out of 100.
Which model leads on Terminal Bench?
GPT-5.5 from OpenAI leads Terminal Bench with a score of 82.7. The median score across 28 tested models is 42.7.
Is Terminal Bench saturated?
No · the top score is 82.7 out of 100 (83%). There is still meaningful room for improvement on Terminal Bench.
Does Terminal Bench predict performance on other benchmarks?
Yes · Terminal Bench scores correlate 0.98 with Cybench across 5 shared models. Models that do well on Terminal Bench tend to do well on Cybench.
How often is Terminal Bench data refreshed?
BenchGecko pulls updates daily. New model scores on Terminal Bench appear as soon as they are published by Epoch AI or the model provider.
- Category
- Code
- Creator
- Terminal Research
- Max score
- 100
- Modality
- Code
- Scoring
- Task completion rate based on filesystem state comparison. The model's terminal session produces a final state that is diff-checked against the expected output.
- Models
- 31
- Updated
- 2026-04-23
“Terminal-Bench is the developer's benchmark. It tests exactly what AI coding assistants do: write code, debug it, run tests, all through the terminal. Mythos at 82.0% means terminal-based AI coding is approaching reliability.”
Top on Terminal Bench
GPT-5.5 · 82.7Claude Mythos Preview · 82.0Gemini 3.1 Pro Preview · 78.4GPT-5.3-Codex · 77.3GPT-5.4 · 75.1More code benchmarks
Same category · related evaluations