Tested on 8 benchmarks with 41.0% average. Top scores: Chatbot Arena Elo — Overall (1374.2%), Lech Mazur Writing (72.9%), MATH level 5 (67.2%).
Multi-language code editing from Aider. Tests editing ability across Python, JavaScript, TypeScript, Java, C++, Go, Rust, and more.
Competition-level math from AMC, AIME, and olympiad problems. Level 5 is the hardest tier, requiring creative problem-solving.
Mock AIME (American Invitational Mathematics Exam) problems from OTIS. Tests mathematical competition performance.
Original research-level math problems created by professional mathematicians. Problems are unpublished and cannot be memorized.
Writing quality evaluation by Lech Mazur. Tests prose quality, coherence, and stylistic ability.
LiveBench fiction analysis. Tests literary comprehension and creative text understanding.
Graduate-level science questions written by PhD experts. Diamond subset contains questions where experts disagree, testing deep understanding.
- Typetext
- ContextN/A
- ReleasedJan 2024
- LicenseOpen Source
- Statusbenchmark-only