Benchmark · Agent

APEX-Agents

APEX-Agents · evaluates AI agents on complex, multi-step tasks requiring planning, tool use, and autonomous decision-making in realistic environments.

Updated 2026-03-05

Models tested

Top score

35.9

GPT-5.4

Median

18.3

min 3.0

Top-5 spread

σ 1.6

settled

The Frontier

Best score over time · one chart, every benchmark

Chart type

Frontier on APEX-Agents rose from 15.2 to 35.9 in 8 months · +20.7 points · latest leader GPT-5.4 from OpenAI.

Pink dots = frontier records · 5 totalClick to open model page

Distribution

Where models cluster

Correlated benchmarks

Pearson r · original research

Correlation analysis

Benchmarks that track with APEX-Agents

Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.

Artificial Analysis · Coding IndexKnowledge

+0.97

6 shared

PostTrainBenchKnowledge

Artificial Analysis · Quality IndexKnowledge

+0.89

6 shared

FrontierMath-2025-02-28-PrivateMath

Full rankings

17 models tested · sorted by score

#	Model	Score	Price
1	GPT-5.4· OpenAI	35.9	$2.50
2	GPT-5.2· OpenAI	34.3	$1.75
3	Gemini 3.1 Pro Preview· Google DeepMind	33.5	$2.00
4	Claude Opus 4.6· Anthropic	31.7	$5.00
5	GPT-5.3-Codex· OpenAI	31.7	$1.75
6	Gemini 3 Flash Preview· Google DeepMind	24.0	$0.50
7	Claude Opus 4.5· Anthropic	18.4	$5.00
8	Gemini 3 Pro· Google DeepMind	18.4	—
9	GPT-5· OpenAI	18.3	$1.25
10	GPT-5.1· OpenAI	17.5	$1.25
11	Grok 4· xAI	15.2	$3.00
12	Kimi K2.5· moonshotai	14.4	$0.38
13	MiniMax M2.5· minimax	6.2	$0.12
14	gpt-oss-120b· OpenAI	4.7	$0.04
15	Kimi K2 Thinking· moonshotai	4.0	$0.60
16	GLM 4.7· z-ai	3.1	$0.39
17	GLM 4.6· z-ai	3.0	$0.39

Frequently asked

Pulled from the APEX-Agents dataset · updated daily

What does APEX-Agents measure?

APEX-Agents is a agent benchmark in the BenchGecko catalog. 17 AI models have been tested on it. Scores range from 3.0 to 35.9 out of 100.

Which model leads on APEX-Agents?

GPT-5.4 from OpenAI leads APEX-Agents with a score of 35.9. The median score across 17 tested models is 18.3.

Is APEX-Agents saturated?

No · the top score is 35.9 out of 100 (36%). There is still meaningful room for improvement on APEX-Agents.

Does APEX-Agents predict performance on other benchmarks?

Yes · APEX-Agents scores correlate 0.97 with Artificial Analysis · Coding Index across 6 shared models. Models that do well on APEX-Agents tend to do well on Artificial Analysis · Coding Index.

How often is APEX-Agents data refreshed?

BenchGecko pulls updates daily. New model scores on APEX-Agents appear as soon as they are published by Epoch AI or the model provider.