Tested on 13 benchmarks with 47.8% average. Top scores: ARC AI2 (81.5%), HellaSwag (78.8%), LAMBADA (71.3%).
BIG-Bench Hard. 23 challenging tasks from BIG-Bench where prior language models fell below average human performance.
HuggingFace MuSR (Multi-Step Reasoning). Tests multi-hop reasoning requiring chaining multiple facts together.
Grade school math word problems. 8,500 problems testing multi-step arithmetic reasoning. A foundational math benchmark.
HuggingFace evaluation of MATH Level 5 problems. Competition math requiring advanced reasoning and proof construction.
AI2 Reasoning Challenge. Grade-school science questions requiring multi-step reasoning. Easy and Challenge sets test different difficulty levels.
Sentence completion requiring commonsense reasoning about physical and social situations. Tests real-world understanding.
Language modeling benchmark testing ability to predict the last word of passages requiring long-range context understanding.
- Typetext
- ContextN/A
- ReleasedJan 2024
- LicenseProprietary
- Statusbenchmark-only