PIQA
PIQA (Physical Interaction QA) β tests intuitive physical reasoning by asking models to select the correct approach for everyday physical tasks.
36
Models Tested
77.4
Top Score
63.2
Average Score
Rankings
| # | Model | Score | Bar |
|---|---|---|---|
| 1 | 77.4 | ||
| 2 | 77.4 | ||
| 3 | 75.0 | ||
| 4 | 71.8 | ||
| 5 | T Falcon-180BTII | 69.8 | |
| 6 | 69.4 | ||
| 7 | 67.8 | ||
| 8 | 67.4 | ||
| 9 | 67.2 | ||
| 10 | 67.0 | ||
| 11 | U Stable Beluga 2unknown | 66.6 | |
| 12 | 66.0 | ||
| 13 | T Falcon-40BTII | 66.0 | |
| 14 | 65.6 | ||
| 15 | 65.6 | ||
| 16 | 65.2 | ||
| 17 | U Nemotron-4 15Bunknown | 64.8 | |
| 18 | 64.6 | ||
| 19 | 63.8 | ||
| 20 | U MPT-30Bunknown | 63.8 | |
| 21 | 62.4 | ||
| 22 | 62.4 | ||
| 23 | 61.6 | ||
| 24 | U MPT-7Bunknown | 61.2 | |
| 25 | T Falcon-7BTII | 60.6 | |
| 26 | 60.2 | ||
| 27 | 59.8 | ||
| 28 | 59.6 | ||
| 29 | 57.6 | ||
| 30 | U Baichuan2-13Bunknown | 56.2 | |
| 31 | 55.8 | ||
| 32 | 54.6 | ||
| 33 | U Baichuan1-7Bunknown | 52.4 | |
| 34 | U XGen-7Bunknown | 51.0 | |
| 35 | U Dolly 2.0-12bunknown | 50.8 | |
| 36 | 47.0 |