ARC AI2
AI2 Reasoning Challenge · tests grade-school level science knowledge with multiple-choice questions requiring reasoning beyond simple retrieval.
The Frontier
Best score over time · one chart, every benchmark
完整排名
35 已测试模型 · 按分数排序
| # | 模型 | 分数 |
|---|---|---|
| 1 | 93.7 | |
| 2 | 93.7 | |
| 3 | 92.7 | |
| 4 | 89.6 | |
| 5 | 88.8 | |
| 6 | 87.6 | |
| 7 | 83.2 | |
| 8 | 83.1 | |
| 9 | 81.7 | |
| 10 | U Stable Beluga 2 | 81.5 |
| 11 | 79.9 | |
| 12 | 79.2 | |
| 13 | 77.1 | |
| 14 | 71.5 | |
| 15 | 67.9 | |
| 16 | 60.7 | |
| 17 | 57.1 | |
| 18 | 47.9 | |
| 19 | 47.1 | |
| 20 | U Nemotron-4 15B | 40.7 |
| 21 | U INTELLECT-1 | 39.4 |
| 22 | 36.9 | |
| 23 | U MPT-30B | 34.1 |
| 24 | U Yi 6B | 33.7 |
| 25 | U StarCoder 2 15B | 29.6 |
| 26 | 26.9 | |
| 27 | 25.9 | |
| 28 | 22.9 | |
| 29 | 22.8 | |
| 30 | U XGen-7B | 21.6 |
| 31 | U Dolly 2.0-12b | 19.5 |
| 32 | 15.2 | |
| 33 | U Baichuan 2-7B | 10.0 |
| 34 | 9.9 | |
| 35 | 0.5 |
分数分布
模型聚集位置
关联基准测试
Pearson r · 原创研究
Benchmarks that track with ARC AI2
Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.
常见问题
About ARC AI2
What does ARC AI2 measure?
AI2 Reasoning Challenge · tests grade-school level science knowledge with multiple-choice questions requiring reasoning beyond simple retrieval. 35 AI models have been tested on it. Scores range from 0.5 to 93.7 out of 100.
Which model leads on ARC AI2?
DeepSeek V3 from DeepSeek leads ARC AI2 with a score of 93.7. The median score across 35 tested models is 47.9.
Is ARC AI2 saturated?
No · the top score is 93.7 out of 100 (94%). There is still meaningful room for improvement on ARC AI2.
Does ARC AI2 predict performance on other benchmarks?
Yes · ARC AI2 scores correlate 0.90 with Chatbot Arena Elo · Overall across 5 shared models. Models that do well on ARC AI2 tend to do well on Chatbot Arena Elo · Overall.
How often is ARC AI2 data refreshed?
BenchGecko pulls updates daily. New model scores on ARC AI2 appear as soon as they are published by Epoch AI or the model provider.
- 类别
- Knowledge
- 最高分
- 100
- 模型
- 35
- 已更新
- 2025-04-15
更多 knowledge 基准测试
同类别 · 相关评测