Tested on 14 benchmarks with 42.5% average. Top scores: TriviaQA (79.6%), LAMBADA (76.5%), HellaSwag (74.3%).
BIG-Bench Hard. 23 challenging tasks from BIG-Bench where prior language models fell below average human performance.
Grade school math word problems. 8,500 problems testing multi-step arithmetic reasoning. A foundational math benchmark.
Competition-level math from AMC, AIME, and olympiad problems. Level 5 is the hardest tier, requiring creative problem-solving.
Trivia questions sourced from trivia enthusiasts and quiz websites. Tests breadth of general knowledge.
Language modeling benchmark testing ability to predict the last word of passages requiring long-range context understanding.
Sentence completion requiring commonsense reasoning about physical and social situations. Tests real-world understanding.
- Typetext
- ContextN/A
- ReleasedJan 2024
- LicenseOpen Source
- Statusbenchmark-only