HellaSwag
HellaSwag · tests commonsense reasoning by asking models to predict the most plausible continuation of everyday scenarios.
The Frontier
Best score over time · one chart, every benchmark
Full rankings
29 models tested · sorted by score
| # | Model | Score |
|---|---|---|
| 1 | 93.7 | |
| 2 | 85.6 | |
| 3 | 85.3 | |
| 4 | 85.2 | |
| 5 | 82.8 | |
| 6 | 82.3 | |
| 7 | 79.7 | |
| 8 | U Stable Beluga 2 | 78.8 |
| 9 | 77.3 | |
| 10 | 77.2 | |
| 11 | U Nemotron-4 15B | 76.5 |
| 12 | 76.5 | |
| 13 | 74.7 | |
| 14 | 74.3 | |
| 15 | 72.3 | |
| 16 | 69.3 | |
| 17 | 69.1 | |
| 18 | 68.9 | |
| 19 | U MPT-30B | 68.5 |
| 20 | U Yi 6B | 65.9 |
| 21 | U XGen-7B | 65.6 |
| 22 | U INTELLECT-1 | 61.9 |
| 23 | 61.9 | |
| 24 | U Dolly 2.0-12b | 61.1 |
| 25 | U Baichuan 2-7B | 57.3 |
| 26 | 49.1 | |
| 27 | 45.9 | |
| 28 | 38.1 | |
| 29 | 30.1 |
Score distribution
Where models cluster
Correlated benchmarks
Pearson r · original research
Benchmarks that track with HellaSwag
Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.
Frequently asked
About HellaSwag
What does HellaSwag measure?
HellaSwag · tests commonsense reasoning by asking models to predict the most plausible continuation of everyday scenarios. 29 AI models have been tested on it. Scores range from 30.1 to 93.7 out of 100.
Which model leads on HellaSwag?
GPT-4 Turbo from OpenAI leads HellaSwag with a score of 93.7. The median score across 29 tested models is 72.3.
Is HellaSwag saturated?
No · the top score is 93.7 out of 100 (94%). There is still meaningful room for improvement on HellaSwag.
Does HellaSwag predict performance on other benchmarks?
Yes · HellaSwag scores correlate 0.93 with PIQA across 17 shared models. Models that do well on HellaSwag tend to do well on PIQA.
How often is HellaSwag data refreshed?
BenchGecko pulls updates daily. New model scores on HellaSwag appear as soon as they are published by Epoch AI or the model provider.
- Category
- Reasoning
- Max score
- 100
- Models
- 29
- Updated
- 2025-04-15
Top on HellaSwag
GPT-4 Turbo · 93.7Llama 3.1 405B · 85.6Falcon-180B · 85.3DeepSeek V3 · 85.2DeepSeek-V2 (MoE-236B, May 2024) · 82.8More reasoning benchmarks
Same category · related evaluations