Beta
Benchmark · Reasoning

ARC-AGI-2

ARC-AGI-2 · the second iteration of the Abstraction and Reasoning Corpus, testing novel pattern recognition and abstract reasoning without prior training data.

Updated 2026-03-05
Models tested
50
Top score
83.3
GPT-5.4 Pro
Median
4.7
min 0.1
Top-5 spread
σ 7.7
wide open

Best score over time · one chart, every benchmark

ARC-AGI-245 MODELS · FRONTIER RUNNING MAX0255075100SCORE ↑Jul 24Dec 24May 25Oct 25Mar 26RELEASE DATE →benchgecko.ai/benchmark/arc-agi-2 · frontier
Frontier on ARC-AGI-2 rose from 0.1 to 83.3 in 20 months · +83.2 points · latest leader GPT-5.4 Pro from OpenAI.
Pink dots = frontier records · 13 totalClick to open model page

Where models cluster

SCORE DISTRIBUTION350–10510–2020–30330–4040–50250–60260–70270–80180–9090–100MEDIAN · 4.7SCORE BUCKET → (0 TO 100)MODELSbenchgecko.ai

Pearson r · original research

Correlation analysis

Benchmarks that track with ARC-AGI-2

Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.

50 models tested · sorted by score

#ModelScore
1OpenAI logoGPT-5.4 Pro83.3
2Google DeepMind logoGemini 3.1 Pro Preview77.1
3OpenAI logoGPT-5.474.0
4Anthropic logoClaude Opus 4.669.2
5Anthropic logoClaude Sonnet 4.660.4
6OpenAI logoGPT-5.2 Pro54.2
7OpenAI logoGPT-5.252.9
8Anthropic logoClaude Opus 4.537.6
9Google DeepMind logoGemini 3 Flash Preview33.6
10Google DeepMind logoGemini 3 Pro31.1
11OpenAI logoGPT-5 Pro18.3
12OpenAI logoGPT-5.117.6
13xAI logoGrok 416.0
14Anthropic logoClaude Sonnet 4.513.6
15moonshotai logoKimi K2.511.8
16OpenAI logoGPT-59.9
17Anthropic logoClaude Opus 48.6
18OpenAI logoo36.5
19OpenAI logoo4 Mini6.1
20Anthropic logoClaude Sonnet 45.9
21xAI logoGrok 4 Fast5.3
22Google DeepMind logoGemini 2.5 Pro4.9
23z-ai logoGLM 54.9
24minimax logoMiniMax M2.54.9
25OpenAI logoo3 Pro4.9
26OpenAI logoGPT-5 Mini4.4
27Anthropic logoClaude Haiku 4.54.0
28DeepSeek logoDeepSeek V3.24.0
29OpenAI logoo3 Mini3.0
30OpenAI logoGPT-5 Nano2.6
31Google DeepMind logoGemini 2.5 Flash2.5
32Google DeepMind logoGemini 2.0 Flash1.3
33DeepSeek logoR11.3
34Alibaba Qwen logoQwen3 235B A22B Instruct 25071.3
35DeepSeek logoR1 05281.1
36Anthropic logoClaude 3.7 Sonnet0.9
37OpenAI logoo1-mini0.8
38Google DeepMind logoGemini 1.5 Pro (Feb 2024)0.8
39OpenAI logoGPT-4.50.8
40OpenAI logoGPT-4.10.4
41xAI logoGrok 3 Mini0.4
42OpenAI logoGPT-4.1 Mini0.1
43OpenAI logoGPT-4.1 Nano0.1
44OpenAI logoGPT-4o (2024-11-20)0.1
45OpenAI logoGPT-4o-mini0.1
46OpenAI logoGPT-4o-mini (2024-07-18)0.1
47xAI logoGrok 30.1
48Meta logoLlama 4 Maverick0.1
49Meta logoLlama 4 Scout0.1
50
U
Magistral Small 1.1
0.1

Pulled from the ARC-AGI-2 dataset · updated daily

What does ARC-AGI-2 measure?

ARC-AGI-2 is a reasoning benchmark in the BenchGecko catalog. 50 AI models have been tested on it. Scores range from 0.1 to 83.3 out of 100.

Which model leads on ARC-AGI-2?

GPT-5.4 Pro from OpenAI leads ARC-AGI-2 with a score of 83.3. The median score across 50 tested models is 4.7.

Is ARC-AGI-2 saturated?

No · the top score is 83.3 out of 100 (83%). There is still meaningful room for improvement on ARC-AGI-2.

Does ARC-AGI-2 predict performance on other benchmarks?

Yes · ARC-AGI-2 scores correlate 0.94 with GSO-Bench across 16 shared models. Models that do well on ARC-AGI-2 tend to do well on GSO-Bench.

How often is ARC-AGI-2 data refreshed?

BenchGecko pulls updates daily. New model scores on ARC-AGI-2 appear as soon as they are published by Epoch AI or the model provider.

Same category · related evaluations