API

BBH

BIG-Bench Hard β€” a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average humans.

37
Models Tested
83.3
Top Score
42.3
Average Score