APEX-Agents
APEX-Agents · evaluates AI agents on complex, multi-step tasks requiring planning, tool use, and autonomous decision-making in realistic environments.
The Frontier
Best score over time · one chart, every benchmark
전체 순위
17 모델 테스트 완료 · 점수 순 정렬
| # | 모델 | 점수 |
|---|---|---|
| 1 | 35.9 | |
| 2 | 34.3 | |
| 3 | 33.5 | |
| 4 | 31.7 | |
| 5 | 31.7 | |
| 6 | 24.0 | |
| 7 | 18.4 | |
| 8 | 18.4 | |
| 9 | 18.3 | |
| 10 | 17.5 | |
| 11 | 15.2 | |
| 12 | 14.4 | |
| 13 | 6.2 | |
| 14 | 4.7 | |
| 15 | 4.0 | |
| 16 | 3.1 | |
| 17 | 3.0 |
점수 분포
모델 밀집 구간
상관 벤치마크
Pearson r · 독자 연구
Benchmarks that track with APEX-Agents
Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.
자주 묻는 질문
About APEX-Agents
What does APEX-Agents measure?
APEX-Agents · evaluates AI agents on complex, multi-step tasks requiring planning, tool use, and autonomous decision-making in realistic environments. 17 AI models have been tested on it. Scores range from 3.0 to 35.9 out of 100.
Which model leads on APEX-Agents?
GPT-5.4 from OpenAI leads APEX-Agents with a score of 35.9. The median score across 17 tested models is 18.3.
Is APEX-Agents saturated?
No · the top score is 35.9 out of 100 (36%). There is still meaningful room for improvement on APEX-Agents.
Does APEX-Agents predict performance on other benchmarks?
Yes · APEX-Agents scores correlate 0.97 with Artificial Analysis · Coding Index across 6 shared models. Models that do well on APEX-Agents tend to do well on Artificial Analysis · Coding Index.
How often is APEX-Agents data refreshed?
BenchGecko pulls updates daily. New model scores on APEX-Agents appear as soon as they are published by Epoch AI or the model provider.
- 카테고리
- Agent
- 최대 점수
- 100
- 모델
- 17
- 업데이트
- 2026-03-05
agent 벤치마크 더 보기
같은 카테고리 · 관련 평가