Tested on 16 benchmarks with 53.3% average. Top scores: HELM — IFEval (95.1%), MATH level 5 (90.9%), HELM — MMLU-Pro (79.9%).
Multi-language code editing from Aider. Tests editing ability across Python, JavaScript, TypeScript, Java, C++, Go, Rust, and more.
Unusual and adversarial machine learning challenges. Tests robustness of reasoning about edge cases in ML systems.
Stanford HELM WildBench evaluation. Tests reasoning on challenging real-world tasks.
Abstraction and Reasoning Corpus. Tests fluid intelligence through novel visual pattern recognition puzzles. Core measure of general intelligence.
ARC-AGI 2, harder sequel to ARC. More complex abstract reasoning patterns that test generalization ability beyond training data.
Competition-level math from AMC, AIME, and olympiad problems. Level 5 is the hardest tier, requiring creative problem-solving.
Mock AIME (American Invitational Mathematics Exam) problems from OTIS. Tests mathematical competition performance.
Stanford HELM evaluation of mathematical reasoning across diverse problem types.
- Typetext
- ContextN/A
- ReleasedJan 2024
- LicenseProprietary
- Statusbenchmark-only