ARC AI2
AI2 Reasoning Challenge · tests grade-school level science knowledge with multiple-choice questions requiring reasoning beyond simple retrieval.
The Frontier
Best score over time · one chart, every benchmark
全ランキング
35 テスト済みモデル · スコア順
| # | モデル | スコア |
|---|---|---|
| 1 | 93.7 | |
| 2 | 93.7 | |
| 3 | 92.7 | |
| 4 | 89.6 | |
| 5 | 88.8 | |
| 6 | 87.6 | |
| 7 | 83.2 | |
| 8 | 83.1 | |
| 9 | 81.7 | |
| 10 | U Stable Beluga 2 | 81.5 |
| 11 | 79.9 | |
| 12 | 79.2 | |
| 13 | 77.1 | |
| 14 | 71.5 | |
| 15 | 67.9 | |
| 16 | 60.7 | |
| 17 | 57.1 | |
| 18 | 47.9 | |
| 19 | 47.1 | |
| 20 | U Nemotron-4 15B | 40.7 |
| 21 | U INTELLECT-1 | 39.4 |
| 22 | 36.9 | |
| 23 | U MPT-30B | 34.1 |
| 24 | U Yi 6B | 33.7 |
| 25 | U StarCoder 2 15B | 29.6 |
| 26 | 26.9 | |
| 27 | 25.9 | |
| 28 | 22.9 | |
| 29 | 22.8 | |
| 30 | U XGen-7B | 21.6 |
| 31 | U Dolly 2.0-12b | 19.5 |
| 32 | 15.2 | |
| 33 | U Baichuan 2-7B | 10.0 |
| 34 | 9.9 | |
| 35 | 0.5 |
スコア分布
モデルが集中する場所
相関ベンチマーク
ピアソンr · 独自調査
Benchmarks that track with ARC AI2
Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.
よくある質問
About ARC AI2
What does ARC AI2 measure?
AI2 Reasoning Challenge · tests grade-school level science knowledge with multiple-choice questions requiring reasoning beyond simple retrieval. 35 AI models have been tested on it. Scores range from 0.5 to 93.7 out of 100.
Which model leads on ARC AI2?
DeepSeek V3 from DeepSeek leads ARC AI2 with a score of 93.7. The median score across 35 tested models is 47.9.
Is ARC AI2 saturated?
No · the top score is 93.7 out of 100 (94%). There is still meaningful room for improvement on ARC AI2.
Does ARC AI2 predict performance on other benchmarks?
Yes · ARC AI2 scores correlate 0.90 with Chatbot Arena Elo · Overall across 5 shared models. Models that do well on ARC AI2 tend to do well on Chatbot Arena Elo · Overall.
How often is ARC AI2 data refreshed?
BenchGecko pulls updates daily. New model scores on ARC AI2 appear as soon as they are published by Epoch AI or the model provider.
- カテゴリ
- Knowledge
- 最高スコア
- 100
- モデル
- 35
- 更新日
- 2025-04-15
その他のknowledgeベンチマーク
同カテゴリ · 関連する評価