ARC AI2
AI2 Reasoning Challenge · tests grade-school level science knowledge with multiple-choice questions requiring reasoning beyond simple retrieval.
The Frontier
Best score over time · one chart, every benchmark
Distribution
Where models cluster
Correlated benchmarks
Pearson r · original research
Benchmarks that track with ARC AI2
Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.
Full rankings
35 models tested · sorted by score
| # | Model | Score |
|---|---|---|
| 1 | 93.7 | |
| 2 | 93.7 | |
| 3 | 92.7 | |
| 4 | 89.6 | |
| 5 | 88.8 | |
| 6 | 87.6 | |
| 7 | 83.2 | |
| 8 | 83.1 | |
| 9 | 81.7 | |
| 10 | U Stable Beluga 2 | 81.5 |
| 11 | 79.9 | |
| 12 | 79.2 | |
| 13 | 77.1 | |
| 14 | 71.5 | |
| 15 | 67.9 | |
| 16 | 60.7 | |
| 17 | 57.1 | |
| 18 | 47.9 | |
| 19 | 47.1 | |
| 20 | U Nemotron-4 15B | 40.7 |
| 21 | U INTELLECT-1 | 39.4 |
| 22 | 36.9 | |
| 23 | U MPT-30B | 34.1 |
| 24 | U Yi 6B | 33.7 |
| 25 | U StarCoder 2 15B | 29.6 |
| 26 | 26.9 | |
| 27 | 25.9 | |
| 28 | 22.9 | |
| 29 | 22.8 | |
| 30 | U XGen-7B | 21.6 |
| 31 | U Dolly 2.0-12b | 19.5 |
| 32 | 15.2 | |
| 33 | U Baichuan 2-7B | 10.0 |
| 34 | 9.9 | |
| 35 | 0.5 |
Frequently asked
Pulled from the ARC AI2 dataset · updated daily
What does ARC AI2 measure?
ARC AI2 is a knowledge benchmark in the BenchGecko catalog. 35 AI models have been tested on it. Scores range from 0.5 to 93.7 out of 100.
Which model leads on ARC AI2?
DeepSeek V3 from DeepSeek leads ARC AI2 with a score of 93.7. The median score across 35 tested models is 47.9.
Is ARC AI2 saturated?
No · the top score is 93.7 out of 100 (94%). There is still meaningful room for improvement on ARC AI2.
Does ARC AI2 predict performance on other benchmarks?
Yes · ARC AI2 scores correlate 0.90 with Chatbot Arena Elo · Overall across 5 shared models. Models that do well on ARC AI2 tend to do well on Chatbot Arena Elo · Overall.
How often is ARC AI2 data refreshed?
BenchGecko pulls updates daily. New model scores on ARC AI2 appear as soon as they are published by Epoch AI or the model provider.
More knowledge benchmarks
Same category · related evaluations