Which model leads on OSWorld?

Claude Mythos Preview from Anthropic leads OSWorld with a score of 79.6. The median score across 9 tested models is 62.9.

Is OSWorld saturated?

No · the top score is 79.6 out of 100 (80%). There is still meaningful room for improvement on OSWorld.

Yes · OSWorld scores correlate 0.96 with Terminal Bench across 5 shared models. Models that do well on OSWorld tend to do well on Terminal Bench.

BenchGecko pulls updates daily. New model scores on OSWorld appear as soon as they are published by Epoch AI or the model provider.

Benchmark · AgentCompetitive

Name: OSWorld Benchmark
Creator: BenchGecko
License: https://creativecommons.org/licenses/by/4.0/

OSWorld · tests AI agents on real-world computer tasks across operating systems, including web browsing, file management, and application use.

Updated 2026-04-23

Models tested

Top score

79.6

Claude Mythos Preview

Median

62.9

min 5.0

Top-5 spread

σ 7.4

wide open

Best score over time · one chart, every benchmark

Chart type

Frontier on OSWorld rose from 5.0 to 79.6 in 19 months · +74.6 points · latest leader Claude Mythos Preview from Anthropic.

Pink dots = frontier records · 6 totalClick to open model page

12 models tested · sorted by score · includes 4 verified scores

#	Model	Score	Price	Source
1	Claude Mythos Preview· Anthropic	79.6	—	Anthropic Mythos System Card, Apr 2026
2	GPT-5.5· OpenAI	78.7	$5.00
3	Claude Opus 4.7· Anthropicverified	78.0	—	Anthropic Opus 4.7 Announcement, Apr 2026
4	GPT-5.4· OpenAIverified	75.0	—	OpenAI GPT-5.4 Blog, Mar 2026
5	Claude Opus 4.6· Anthropicverified	72.7	—	Anthropic Opus 4.6 System Card, Feb 2026
6	Claude Opus 4.5· Anthropic	66.3	$5.00
7	Kimi K2.5· moonshotai	63.3	$0.44
8	Claude Sonnet 4.5· Anthropic	62.9	$3.00
9	Claude Sonnet 4· Anthropic	43.9	$3.00
10	Claude 3.7 Sonnet· Anthropic	35.8	$3.00
11	o3· OpenAI	23.0	$2.00
12	Qwen2.5 72B Instruct· Alibaba Qwen	5.0	$0.36