Beta
Benchmark · Knowledge

ARC AI2

AI2 Reasoning Challenge · tests grade-school level science knowledge with multiple-choice questions requiring reasoning beyond simple retrieval.

Updated 2025-04-15
Models tested
35
Top score
93.7
DeepSeek V3
Median
47.9
min 0.5
Top-5 spread
σ 2.1
competitive

Best score over time · one chart, every benchmark

ARC AI27 MODELS · FRONTIER RUNNING MAX0255075100SCORE ↑Apr 24Jul 24Oct 24Jan 25Apr 25RELEASE DATE →benchgecko.ai/benchmark/arc-ai2 · frontier
Only 7 models have been tested on ARC AI2 · not enough history to compute a frontier yet.
Pink dots = frontier records · 1 totalClick to open model page

Where models cluster

SCORE DISTRIBUTION20–10310–20620–30430–40340–50150–60260–70470–80780–90390–100MEDIAN · 47.9SCORE BUCKET → (0 TO 100)MODELSbenchgecko.ai

Pearson r · original research

Correlation analysis

Benchmarks that track with ARC AI2

Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.

35 models tested · sorted by score

Pulled from the ARC AI2 dataset · updated daily

What does ARC AI2 measure?

ARC AI2 is a knowledge benchmark in the BenchGecko catalog. 35 AI models have been tested on it. Scores range from 0.5 to 93.7 out of 100.

Which model leads on ARC AI2?

DeepSeek V3 from DeepSeek leads ARC AI2 with a score of 93.7. The median score across 35 tested models is 47.9.

Is ARC AI2 saturated?

No · the top score is 93.7 out of 100 (94%). There is still meaningful room for improvement on ARC AI2.

Does ARC AI2 predict performance on other benchmarks?

Yes · ARC AI2 scores correlate 0.90 with Chatbot Arena Elo · Overall across 5 shared models. Models that do well on ARC AI2 tend to do well on Chatbot Arena Elo · Overall.

How often is ARC AI2 data refreshed?

BenchGecko pulls updates daily. New model scores on ARC AI2 appear as soon as they are published by Epoch AI or the model provider.

Same category · related evaluations