Beta
Benchmark · Agent

APEX-Agents

APEX-Agents · evaluates AI agents on complex, multi-step tasks requiring planning, tool use, and autonomous decision-making in realistic environments.

Updated 2026-03-05
Models tested
17
Top score
35.9
GPT-5.4
Median
18.3
min 3.0
Top-5 spread
σ 1.6
settled

Best score over time · one chart, every benchmark

APEX-AGENTS16 MODELS · FRONTIER RUNNING MAX0255075100SCORE ↑Jul 25Sep 25Nov 25Jan 26Mar 26RELEASE DATE →benchgecko.ai/benchmark/apex-agents · frontier
Frontier on APEX-Agents rose from 15.2 to 35.9 in 8 months · +20.7 points · latest leader GPT-5.4 from OpenAI.
Pink dots = frontier records · 5 totalClick to open model page

Where models cluster

SCORE DISTRIBUTION50–10610–20120–30530–4040–5050–6060–7070–8080–9090–100MEDIAN · 18.3SCORE BUCKET → (0 TO 100)MODELSbenchgecko.ai

Pearson r · original research

17 models tested · sorted by score

Pulled from the APEX-Agents dataset · updated daily

What does APEX-Agents measure?

APEX-Agents is a agent benchmark in the BenchGecko catalog. 17 AI models have been tested on it. Scores range from 3.0 to 35.9 out of 100.

Which model leads on APEX-Agents?

GPT-5.4 from OpenAI leads APEX-Agents with a score of 35.9. The median score across 17 tested models is 18.3.

Is APEX-Agents saturated?

No · the top score is 35.9 out of 100 (36%). There is still meaningful room for improvement on APEX-Agents.

Does APEX-Agents predict performance on other benchmarks?

Yes · APEX-Agents scores correlate 0.97 with Artificial Analysis · Coding Index across 6 shared models. Models that do well on APEX-Agents tend to do well on Artificial Analysis · Coding Index.

How often is APEX-Agents data refreshed?

BenchGecko pulls updates daily. New model scores on APEX-Agents appear as soon as they are published by Epoch AI or the model provider.

Same category · related evaluations