PIQA
PIQA (Physical Interaction QA) · tests intuitive physical reasoning by asking models to select the correct approach for everyday physical tasks.
The Frontier
Best score over time · one chart, every benchmark
Full rankings
25 models tested · sorted by score
| # | Model | Score |
|---|---|---|
| 1 | 77.4 | |
| 2 | 77.4 | |
| 3 | 75.0 | |
| 4 | 71.8 | |
| 5 | 69.8 | |
| 6 | 69.4 | |
| 7 | 67.8 | |
| 8 | 67.4 | |
| 9 | 67.2 | |
| 10 | 67.0 | |
| 11 | U Stable Beluga 2 | 66.6 |
| 12 | 66.0 | |
| 13 | 65.2 | |
| 14 | U Nemotron-4 15B | 64.8 |
| 15 | U MPT-30B | 63.8 |
| 16 | 62.4 | |
| 17 | 61.6 | |
| 18 | 60.2 | |
| 19 | 59.8 | |
| 20 | U Baichuan 2-7B | 56.2 |
| 21 | 54.6 | |
| 22 | U Baichuan1-7B | 52.4 |
| 23 | U XGen-7B | 51.0 |
| 24 | U Dolly 2.0-12b | 50.8 |
| 25 | 47.0 |
Score distribution
Where models cluster
Correlated benchmarks
Pearson r · original research
Benchmarks that track with PIQA
Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.
Frequently asked
About PIQA
What does PIQA measure?
PIQA (Physical Interaction QA) · tests intuitive physical reasoning by asking models to select the correct approach for everyday physical tasks. 25 AI models have been tested on it. Scores range from 47.0 to 77.4 out of 100.
Which model leads on PIQA?
GPT-4o-mini from OpenAI leads PIQA with a score of 77.4. The median score across 25 tested models is 65.2.
Is PIQA saturated?
No · the top score is 77.4 out of 100 (77%). There is still meaningful room for improvement on PIQA.
Does PIQA predict performance on other benchmarks?
Yes · PIQA scores correlate 0.94 with Winogrande across 15 shared models. Models that do well on PIQA tend to do well on Winogrande.
How often is PIQA data refreshed?
BenchGecko pulls updates daily. New model scores on PIQA appear as soon as they are published by Epoch AI or the model provider.
- Category
- Knowledge
- Max score
- 100
- Models
- 25
- Updated
- 2024-12-26
Top on PIQA
GPT-4o-mini · 77.4GPT-4o-mini (2024-07-18) · 77.4Gemini 1.5 Flash (May 2024) · 75.0Llama 3.1 405B · 71.8Falcon-180B · 69.8More knowledge benchmarks
Same category · related evaluations