HLE
HLE (Humanity's Last Exam) β crowdsourced expert-level questions designed to be among the hardest possible challenges for AI systems across all domains.
27
Models Tested
34.4
Top Score
14.0
Average Score
Rankings
| # | Model | Score | Bar |
|---|---|---|---|
| 1 | 34.4 | ||
| 2 | 31.1 | ||
| 3 | 28.2 | ||
| 4 | 24.2 | ||
| 5 | 24.2 | ||
| 6 | 21.6 | ||
| 7 | 21.6 | ||
| 8 | 21.4 | ||
| 9 | M Kimi K2.5moonshotai | 20.6 | |
| 10 | 19.8 | ||
| 11 | 19.8 | ||
| 12 | 17.7 | ||
| 13 | 16.3 | ||
| 14 | 15.4 | ||
| 15 | 13.9 | ||
| 16 | 9.4 | ||
| 17 | 7.7 | ||
| 18 | 7.1 | ||
| 19 | 6.2 | ||
| 20 | ZA GLM 4.5z-ai | 3.7 | |
| 21 | 3.4 | ||
| 22 | 3.4 | ||
| 23 | 3.1 | ||
| 24 | 1.9 | ||
| 25 | 0.9 | ||
| 26 | 0.7 | ||
| 27 | 0.6 |