Tested on 20 benchmarks with 34.9% average. Top scores: Chatbot Arena Elo — Overall (970.9%), TriviaQA (77.9%), LAMBADA (75.2%).
BIG-Bench Hard. 23 challenging tasks from BIG-Bench where prior language models fell below average human performance.
HuggingFace MuSR (Multi-Step Reasoning). Tests multi-hop reasoning requiring chaining multiple facts together.
Grade school math word problems. 8,500 problems testing multi-step arithmetic reasoning. A foundational math benchmark.
HuggingFace evaluation of MATH Level 5 problems. Competition math requiring advanced reasoning and proof construction.
Trivia questions sourced from trivia enthusiasts and quiz websites. Tests breadth of general knowledge.
Language modeling benchmark testing ability to predict the last word of passages requiring long-range context understanding.
Sentence completion requiring commonsense reasoning about physical and social situations. Tests real-world understanding.
- Typetext
- ContextN/A
- ReleasedJan 2024
- LicenseOpen Source
- Statusbenchmark-only