HellaSwag
HellaSwag β tests commonsense reasoning by asking models to predict the most plausible continuation of everyday scenarios.
37
Models Tested
85.6
Top Score
69.8
Average Score
Rankings
| # | Model | Score | Bar |
|---|---|---|---|
| 1 | 85.6 | ||
| 2 | T Falcon-180BTII | 85.3 | |
| 3 | 85.2 | ||
| 4 | 82.8 | ||
| 5 | 82.3 | ||
| 6 | 80.4 | ||
| 7 | T Falcon-40BTII | 80.4 | |
| 8 | 79.7 | ||
| 9 | 78.9 | ||
| 10 | U Stable Beluga 2unknown | 78.8 | |
| 11 | 77.3 | ||
| 12 | T Falcon 2 11BTII | 77.2 | |
| 13 | 77.1 | ||
| 14 | U Nemotron-4 15Bunknown | 76.5 | |
| 15 | 76.5 | ||
| 16 | 76.3 | ||
| 17 | 74.7 | ||
| 18 | 74.3 | ||
| 19 | 72.3 | ||
| 20 | T Falcon-7BTII | 70.8 | |
| 21 | 69.6 | ||
| 22 | 69.3 | ||
| 23 | 69.1 | ||
| 24 | 68.9 | ||
| 25 | U MPT-7Bunknown | 68.5 | |
| 26 | 68.3 | ||
| 27 | U Yi 6Bunknown | 65.9 | |
| 28 | U XGen-7Bunknown | 65.6 | |
| 29 | U INTELLECT-1unknown | 61.9 | |
| 30 | 61.9 | ||
| 31 | U Baichuan2-13Bunknown | 61.1 | |
| 32 | U Dolly 2.0-12bunknown | 61.1 | |
| 33 | U Baichuan 2-7Bunknown | 57.3 | |
| 34 | 49.1 | ||
| 35 | 45.9 | ||
| 36 | 38.1 | ||
| 37 | 30.1 |