Tested on 7 benchmarks with 34.8% average. Top scores: OTIS Mock AIME 2024-2025 (85.0%), GPQA diamond (78.9%), SimpleQA Verified (26.0%).
Mock AIME (American Invitational Mathematics Exam) problems from OTIS. Tests mathematical competition performance.
Original research-level math problems created by professional mathematicians. Problems are unpublished and cannot be memorized.
Hardest tier of FrontierMath. Problems at the frontier of human mathematical ability, many unsolved by most mathematicians.
Graduate-level science questions written by PhD experts. Diamond subset contains questions where experts disagree, testing deep understanding.
Simple factual questions with verified correct answers. Tests accuracy of basic knowledge retrieval. Low scores indicate hallucination.
Tactical chess puzzles testing pattern recognition and multi-move calculation. Measures strategic reasoning ability.
Agent performance evaluation testing multi-step tool use, planning, and execution in realistic environments.
- Typetext
- ContextN/A
- ReleasedJan 2024
- LicenseOpen Source
- Statusbenchmark-only