Which model leads on ARC AI2?

DeepSeek V3 from DeepSeek leads ARC AI2 with a score of 93.7. The median score across 35 tested models is 47.9.

Is ARC AI2 saturated?

No · the top score is 93.7 out of 100 (94%). There is still meaningful room for improvement on ARC AI2.

Does ARC AI2 predict performance on other benchmarks?

Yes · ARC AI2 scores correlate 0.90 with Chatbot Arena Elo · Overall across 5 shared models. Models that do well on ARC AI2 tend to do well on Chatbot Arena Elo · Overall.

How often is ARC AI2 data refreshed?

BenchGecko pulls updates daily. New model scores on ARC AI2 appear as soon as they are published by Epoch AI or the model provider.

Benchmark · KnowledgeSettled

ARC AI2

Name: ARC AI2 Benchmark
Creator: BenchGecko
License: https://creativecommons.org/licenses/by/4.0/

AI2 Reasoning Challenge · tests grade-school level science knowledge with multiple-choice questions requiring reasoning beyond simple retrieval.

Updated 2025-04-15

Models tested

Top score

93.7

DeepSeek V3

Median

47.9

min 0.5

Top-5 spread

σ 2.1

Competitive

The Frontier

Best score over time · one chart, every benchmark

Chart type

Only 6 models have been tested on ARC AI2 · not enough history to compute a frontier yet.

Pink dots = frontier records · 1 totalClick to open model page

Full rankings

35 models tested · sorted by score

#	Model	Score	Price
1	DeepSeek V3· DeepSeek	93.7	$0.32
2	Llama 3.1 405B· Meta	93.7	—
3	Qwen2.5 72B Instruct· Alibaba Qwen	92.7	$0.36
4	DeepSeek-V2 (MoE-236B, May 2024)· DeepSeek	89.6	—
5	phi-3-medium 14B· Microsoft	88.8	—
6	phi-3-small 7.4B· Microsoft	87.6	—
7	GPT-3.5 Turbo (older v0613)· OpenAI	83.2	$1.00
8	Mixtral 8x7B Instruct· Mistral AI	83.1	$0.54
9	Claude Instant· Anthropic	81.7	—
10	U Stable Beluga 2· Unknown	81.5	—
11	phi-3-mini 3.8B· Microsoft	79.9	—
12	Qwen-14B· Alibaba Qwen	79.2	—
13	Llama 3 8B Instruct· Meta	77.1	$0.03
14	Mistral 7B V0.1· Mistral AI	71.5	—
15	Phi 2· Microsoft	67.9	—
16	Qwen2.5 Coder 32B Instruct· Alibaba Qwen	60.7	$0.66
17	Falcon-180B· TII	57.1	—
18	Qwen2.5 Coder 7B Instruct· Alibaba Qwen	47.9	$0.03
19	Llama 2-13B· Meta	47.1	—
20	U Nemotron-4 15B· Unknown	40.7	—
21	U INTELLECT-1· Unknown	39.4	—
22	LLaMA-13B· Meta	36.9	—
23	U MPT-30B· Unknown	34.1	—
24	U Yi 6B· Unknown	33.7	—
25	U StarCoder 2 15B· Unknown	29.6	—
26	Qwen2.5 Coder 1.5B Instruct· Alibaba	26.9	—
27	Phi-1.5· Microsoft	25.9	—
28	DeepSeek Coder 33B· DeepSeek	22.9	—
29	Gemma 2B· Google DeepMind	22.8	—
30	U XGen-7B· Unknown	21.6	—
31	U Dolly 2.0-12b· Unknown	19.5	—
32	DeepSeek Coder 6.7B· DeepSeek	15.2	—
33	U Baichuan 2-7B· Unknown	10.0	—
34	Cerebras-GPT-13B· OpenAI	9.9	—
35	DeepSeek Coder 1.3B· DeepSeek	0.5	—