Tested on 9 benchmarks with 41.5% average. Top scores: Chatbot Arena Elo — Overall (1387.7%), MATH level 5 (81.7%), Aider — Code Editing (79.7%).
Code editing benchmark from the Aider project. Measures ability to apply targeted code changes while maintaining correctness and style.
Unusual and adversarial machine learning challenges. Tests robustness of reasoning about edge cases in ML systems.
Capture-the-flag cybersecurity challenges. Tests vulnerability analysis, reverse engineering, cryptography, and exploitation skills.
Deceptively simple questions that humans find easy but AI models often get wrong. Tests common sense and reasoning gaps.
Abstraction and Reasoning Corpus. Tests fluid intelligence through novel visual pattern recognition puzzles. Core measure of general intelligence.
Competition-level math from AMC, AIME, and olympiad problems. Level 5 is the hardest tier, requiring creative problem-solving.
Mock AIME (American Invitational Mathematics Exam) problems from OTIS. Tests mathematical competition performance.
- Typetext
- ContextN/A
- ReleasedJan 2024
- LicenseProprietary
- Statusbenchmark-only