Which model leads on PIQA?

GPT-4o-mini from OpenAI leads PIQA with a score of 77.4. The median score across 25 tested models is 65.2.

No · the top score is 77.4 out of 100 (77%). There is still meaningful room for improvement on PIQA.

Does PIQA predict performance on other benchmarks?

Yes · PIQA scores correlate 0.94 with Winogrande across 15 shared models. Models that do well on PIQA tend to do well on Winogrande.

BenchGecko pulls updates daily. New model scores on PIQA appear as soon as they are published by Epoch AI or the model provider.

Benchmark · KnowledgeCompetitive

Name: PIQA Benchmark
Creator: BenchGecko
License: https://creativecommons.org/licenses/by/4.0/

PIQA (Physical Interaction QA) · tests intuitive physical reasoning by asking models to select the correct approach for everyday physical tasks.

Updated 2024-12-26

Models tested

Top score

77.4

GPT-4o-mini

Median

65.2

min 47.0

Top-5 spread

σ 3.0

Competitive

Best score over time · one chart, every benchmark

Chart type

Frontier on PIQA rose from 67.4 to 77.4 in 1 months · +10.0 points · latest leader GPT-4o-mini from OpenAI.

Pink dots = frontier records · 3 totalClick to open model page

25 models tested · sorted by score

#	Model	Score	Price
1	GPT-4o-mini· OpenAI	77.4	$0.15
2	GPT-4o-mini (2024-07-18)· OpenAI	77.4	$0.15
3	Gemini 1.5 Flash (May 2024)· Google DeepMind	75.0	—
4	Llama 3.1 405B· Meta	71.8	—
5	Falcon-180B· TII	69.8	—
6	DeepSeek V3· DeepSeek	69.4	$0.32
7	DeepSeek-V2 (MoE-236B, May 2024)· DeepSeek	67.8	—
8	Gemma 2 9B· Google DeepMind	67.4	$0.03
9	Mixtral 8x7B Instruct· Mistral AI	67.2	$0.54
10	Mistral Nemo· Mistral AI	67.0	$0.02
11	U Stable Beluga 2· Unknown	66.6	—
12	Mistral 7B V0.1· Mistral AI	66.0	—
13	Qwen2.5 72B Instruct· Alibaba Qwen	65.2	$0.36
14	U Nemotron-4 15B· Unknown	64.8	—
15	U MPT-30B· Unknown	63.8	—
16	Llama 3.1 8B Instruct· Meta	62.4	$0.02
17	Llama 2-13B· Meta	61.6	—
18	LLaMA-13B· Meta	60.2	—
19	Qwen-14B· Alibaba Qwen	59.8	—
20	U Baichuan 2-7B· Unknown	56.2	—
21	Gemma 2B· Google DeepMind	54.6	—
22	U Baichuan1-7B· Unknown	52.4	—
23	U XGen-7B· Unknown	51.0	—
24	U Dolly 2.0-12b· Unknown	50.8	—
25	Cerebras-GPT-13B· OpenAI	47.0	—