Benchmark · AgentCompetitive

OSWorld

OSWorld · tests AI agents on real-world computer tasks across operating systems, including web browsing, file management, and application use.

Updated 2026-04-23
Models tested
12
Top score
79.6
Claude Mythos Preview
Median
62.9
min 5.0
Top-5 spread
σ 7.4
wide open

Best score over time · one chart, every benchmark

OSWORLD9 MODELS · FRONTIER RUNNING MAX0255075100SCORE ↑Sep 24Feb 25Jul 25Nov 25Apr 26RELEASE DATE →benchgecko.ai/benchmark/osworld · frontier
Frontier on OSWorld rose from 5.0 to 79.6 in 19 months · +74.6 points · latest leader Claude Mythos Preview from Anthropic.
Pink dots = frontier records · 6 totalClick to open model page
Details
Category
Agent
Max score
100
Models
12
Updated
2026-04-23

Same category · related evaluations