BBH
BIG-Bench Hard β a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average humans.
37
Models Tested
83.3
Top Score
42.3
Average Score
Rankings
| # | Model | Score | Bar |
|---|---|---|---|
| 1 | 83.3 | ||
| 2 | 77.2 | ||
| 3 | 75.2 | ||
| 4 | 73.1 | ||
| 5 | 72.1 | ||
| 6 | 71.7 | ||
| 7 | 66.8 | ||
| 8 | 66.8 | ||
| 9 | 66.8 | ||
| 10 | 66.8 | ||
| 11 | U Yi-34Bunknown | 62.3 | |
| 12 | 62.3 | ||
| 13 | U Stable Beluga 2unknown | 59.1 | |
| 14 | 53.2 | ||
| 15 | 45.9 | ||
| 16 | U Nemotron-4 15Bunknown | 44.9 | |
| 17 | 44.5 | ||
| 18 | 44.3 | ||
| 19 | 41.5 | ||
| 20 | 40.1 | ||
| 21 | 40.0 | ||
| 22 | 33.3 | ||
| 23 | U Baichuan2-13Bunknown | 32.0 | |
| 24 | U Yi 6Bunknown | 29.6 | |
| 25 | 26.7 | ||
| 26 | 25.5 | ||
| 27 | U Baichuan 2-7Bunknown | 22.1 | |
| 28 | 18.9 | ||
| 29 | U MPT-30Bunknown | 17.3 | |
| 30 | 17.2 | ||
| 31 | T Falcon-40BTII | 16.1 | |
| 32 | U MPT-7Bunknown | 14.1 | |
| 33 | 13.6 | ||
| 34 | U INTELLECT-1unknown | 13.1 | |
| 35 | 11.3 | ||
| 36 | U Baichuan1-7Bunknown | 10.0 | |
| 37 | T Falcon-7BTII | 5.0 |