Tested on 10 benchmarks with 24.3% average. Top scores: GSM8K (57.7%), MMLU (52.1%), ARC AI2 (29.6%).
HuggingFace MuSR (Multi-Step Reasoning). Tests multi-hop reasoning requiring chaining multiple facts together.
Grade school math word problems. 8,500 problems testing multi-step arithmetic reasoning. A foundational math benchmark.
HuggingFace evaluation of MATH Level 5 problems. Competition math requiring advanced reasoning and proof construction.
Massive Multitask Language Understanding. 57 subjects from STEM, humanities, and social sciences. The most widely-cited knowledge benchmark.
AI2 Reasoning Challenge. Grade-school science questions requiring multi-step reasoning. Easy and Challenge sets test different difficulty levels.
Commonsense coreference resolution. Tests understanding of pronoun references in ambiguous sentences.
- Typetext
- ContextN/A
- ReleasedJan 2024
- LicenseOpen Source
- Statusbenchmark-only