Tested on 6 benchmarks with 28.3% average. Top scores: MMLU (67.9%), Winogrande (50.2%), GPQA diamond (20.8%).
Unusual and adversarial machine learning challenges. Tests robustness of reasoning about edge cases in ML systems.
Competition-level math from AMC, AIME, and olympiad problems. Level 5 is the hardest tier, requiring creative problem-solving.
Mock AIME (American Invitational Mathematics Exam) problems from OTIS. Tests mathematical competition performance.
Massive Multitask Language Understanding. 57 subjects from STEM, humanities, and social sciences. The most widely-cited knowledge benchmark.
Commonsense coreference resolution. Tests understanding of pronoun references in ambiguous sentences.
Graduate-level science questions written by PhD experts. Diamond subset contains questions where experts disagree, testing deep understanding.
- Typetext
- ContextN/A
- ReleasedJan 2024
- LicenseProprietary
- Statusbenchmark-only