Tested on 25 benchmarks with 42.3% average. Top scores: Chatbot Arena Elo — Overall (1371.4%), HELM — IFEval (85.6%), Aider — Code Editing (84.2%).
Code editing benchmark from the Aider project. Measures ability to apply targeted code changes while maintaining correctness and style.
Multi-language code editing from Aider. Tests editing ability across Python, JavaScript, TypeScript, Java, C++, Go, Rust, and more.
Computer-aided design evaluation. Tests understanding of CAD concepts, 3D modeling, and engineering design principles.
Stanford HELM WildBench evaluation. Tests reasoning on challenging real-world tasks.
Deceptively simple questions that humans find easy but AI models often get wrong. Tests common sense and reasoning gaps.
Competition-level math from AMC, AIME, and olympiad problems. Level 5 is the hardest tier, requiring creative problem-solving.
Stanford HELM evaluation of mathematical reasoning across diverse problem types.
Mock AIME (American Invitational Mathematics Exam) problems from OTIS. Tests mathematical competition performance.
- Typetext
- ContextN/A
- ReleasedJan 2024
- LicenseProprietary
- Statusbenchmark-only