OSWorld
OSWorld · tests AI agents on real-world computer tasks across operating systems, including web browsing, file management, and application use.
The Frontier
Best score over time · one chart, every benchmark
Full rankings
12 models tested · sorted by score · includes 4 verified scores
| # | Model | Score |
|---|---|---|
| 1 | 79.6 | |
| 2 | 78.7 | |
| 3 | 78.0 | |
| 4 | 75.0 | |
| 5 | 72.7 | |
| 6 | 66.3 | |
| 7 | 63.3 | |
| 8 | 62.9 | |
| 9 | 43.9 | |
| 10 | 35.8 | |
| 11 | 23.0 | |
| 12 | 5.0 |
Score distribution
Where models cluster
Correlated benchmarks
Pearson r · original research
Benchmarks that track with OSWorld
Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.
Frequently asked
About OSWorld
What does OSWorld measure?
OSWorld · tests AI agents on real-world computer tasks across operating systems, including web browsing, file management, and application use. 9 AI models have been tested on it. Scores range from 5.0 to 79.6 out of 100.
Which model leads on OSWorld?
Claude Mythos Preview from Anthropic leads OSWorld with a score of 79.6. The median score across 9 tested models is 62.9.
Is OSWorld saturated?
No · the top score is 79.6 out of 100 (80%). There is still meaningful room for improvement on OSWorld.
Does OSWorld predict performance on other benchmarks?
Yes · OSWorld scores correlate 0.96 with Terminal Bench across 5 shared models. Models that do well on OSWorld tend to do well on Terminal Bench.
How often is OSWorld data refreshed?
BenchGecko pulls updates daily. New model scores on OSWorld appear as soon as they are published by Epoch AI or the model provider.
- Category
- Agent
- Max score
- 100
- Models
- 12
- Updated
- 2026-04-23
Top on OSWorld
Claude Mythos Preview · 79.6GPT-5.5 · 78.7Claude Opus 4.7 · 78.0GPT-5.4 · 75.0Claude Opus 4.6 · 72.7More agent benchmarks
Same category · related evaluations