Tested on 8 benchmarks with 59.0% average. Top scores: OTIS Mock AIME 2024-2025 (88.9%), GPQA diamond (86.4%), SimpleQA Verified (66.3%).
Mock AIME (American Invitational Mathematics Exam) problems from OTIS. Tests mathematical competition performance.
Original research-level math problems created by professional mathematicians. Problems are unpublished and cannot be memorized.
Hardest tier of FrontierMath. Problems at the frontier of human mathematical ability, many unsolved by most mathematicians.
Graduate-level science questions written by PhD experts. Diamond subset contains questions where experts disagree, testing deep understanding.
Simple factual questions with verified correct answers. Tests accuracy of basic knowledge retrieval. Low scores indicate hallucination.
Artificial Analysis Agentic Index. Composite score measuring agent capability across tool use and planning tasks.
Artificial Analysis Quality Index. Composite quality score combining multiple benchmark results into a single metric.
Artificial Analysis Coding Index. Composite coding quality score from multiple code benchmarks.
- Typetext
- ContextN/A
- ReleasedJan 2024
- LicenseProprietary
- Statusbenchmark-only