APEX-Agents
APEX-Agents · evaluates AI agents on complex, multi-step tasks requiring planning, tool use, and autonomous decision-making in realistic environments.
The Frontier
Best score over time · one chart, every benchmark
Distribution
Where models cluster
Correlated benchmarks
Pearson r · original research
Benchmarks that track with APEX-Agents
Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.
Full rankings
17 models tested · sorted by score
| # | Model | Score |
|---|---|---|
| 1 | 35.9 | |
| 2 | 34.3 | |
| 3 | 33.5 | |
| 4 | 31.7 | |
| 5 | 31.7 | |
| 6 | 24.0 | |
| 7 | 18.4 | |
| 8 | 18.4 | |
| 9 | 18.3 | |
| 10 | 17.5 | |
| 11 | 15.2 | |
| 12 | 14.4 | |
| 13 | 6.2 | |
| 14 | 4.7 | |
| 15 | 4.0 | |
| 16 | 3.1 | |
| 17 | 3.0 |
Frequently asked
Pulled from the APEX-Agents dataset · updated daily
What does APEX-Agents measure?
APEX-Agents is a agent benchmark in the BenchGecko catalog. 17 AI models have been tested on it. Scores range from 3.0 to 35.9 out of 100.
Which model leads on APEX-Agents?
GPT-5.4 from OpenAI leads APEX-Agents with a score of 35.9. The median score across 17 tested models is 18.3.
Is APEX-Agents saturated?
No · the top score is 35.9 out of 100 (36%). There is still meaningful room for improvement on APEX-Agents.
Does APEX-Agents predict performance on other benchmarks?
Yes · APEX-Agents scores correlate 0.97 with Artificial Analysis · Coding Index across 6 shared models. Models that do well on APEX-Agents tend to do well on Artificial Analysis · Coding Index.
How often is APEX-Agents data refreshed?
BenchGecko pulls updates daily. New model scores on APEX-Agents appear as soon as they are published by Epoch AI or the model provider.
More agent benchmarks
Same category · related evaluations