Every AI Benchmark · Tracked
The most complete catalog of AI evaluations · every benchmark categorized, ranked across every model, and plotted against time on the frontier. Correlation analysis and live shortlists included.
The Frontier
Best score over time · one chart, every benchmark
Best for the job
Top 5 models for common tasks · rankings refresh every time the data does, no editorial picks.
From benchmark to deployment
Follow benchmark winners into pricing, provider, model, and comparison views
By category
6 hand-curated buckets · every benchmark assigned
Reasoning
· 11 benchmarksMulti-step logic, abstract pattern recognition, and adversarial inference tasks.
Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.
SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.
ARC-AGI-2 · the second iteration of the Abstraction and Reasoning Corpus, testing novel pattern recognition and abstract reasoning without prior training data.
ARC-AGI · the original Abstraction and Reasoning Corpus, testing whether AI can solve novel visual pattern recognition tasks without memorization.
WinoGrande · large-scale commonsense reasoning benchmark where models must resolve ambiguous pronouns in carefully constructed sentence pairs.
HellaSwag · tests commonsense reasoning by asking models to predict the most plausible continuation of everyday scenarios.
BIG-Bench Hard · a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average humans.
Chess Puzzles · tests strategic and tactical reasoning by having models solve chess puzzle positions, evaluating lookahead and pattern recognition abilities.
HLE (Humanity's Last Exam) · a reasoning benchmark designed to be the hardest public evaluation of AI. Questions span mathematics, physics, philosophy, and logic · curated to be at or beyond the frontier of human expert capability. Tested with and without tool augmentation. Claude Opus 4.7 scores 46.9% without tools and 54.7% with tools · making it one of the few benchmarks where the top score is below 60%.
Code
· 8 benchmarksSoftware engineering tasks · real bug fixes, multi-language code editing, terminal proficiency.
WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns.
Aider Polyglot · measures how well AI models can edit code across multiple programming languages using the Aider coding assistant framework.
Terminal-Bench 2.0 · evaluates AI agents on real terminal-based coding tasks · writing scripts, debugging, running tests, and managing projects entirely through command-line interaction. Tests both code quality and terminal fluency. Claude Opus 4.7 scores 69.4%, demonstrating significant agentic terminal competence.
SWE-bench Verified · 500 human-validated tasks from 12 real Python repositories (Django, Flask, scikit-learn, sympy, and others). Each task requires the model to produce a git patch that resolves a real GitHub issue and passes the test suite. The verified subset eliminates ambiguous tasks from the original SWE-bench. Claude Mythos Preview leads at 93.9%, crossing 90% for the first time in 2026. Opus 4.6 scores 80.8%. The benchmark remains the most-cited evaluation for code-generation capability.
Cybench · evaluates AI on real Capture-The-Flag cybersecurity challenges, testing vulnerability analysis, exploitation, and security reasoning.
SWE-Bench Verified (Bash Only) · a curated subset of SWE-bench where models fix real Python repository bugs using only bash commands, no agent frameworks.
Math
· 5 benchmarksGrade-school arithmetic through competition math and frontier research problems.
OTIS Mock AIME 2024-2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.
MATH Level 5 · the hardest tier of the MATH benchmark, featuring competition-level problems from AMC, AIME, and Olympiad-style mathematics.
FrontierMath (Feb 2025) · original research-level math problems created by mathematicians, testing capabilities at the boundary of current AI mathematical reasoning.
FrontierMath Tier 4 (Jul 2025) · the most challenging tier of frontier mathematics, containing problems that push the absolute limits of AI mathematical reasoning.
Multimodal
· 5 benchmarksVision, video, and cross-modal tasks combining text with other input types.
VPCT (Visual Pattern Completion Test) · tests visual reasoning and pattern recognition by having models complete visual sequences and transformations.
VideoMME · multimodal benchmark testing video understanding across diverse domains, requiring temporal reasoning and cross-frame comprehension.
ScienceQA · multimodal science questions spanning natural science, social science, and language science with diverse question formats and image context.
CharXiv Reasoning · tests a model's ability to understand and reason about charts, plots, and figures extracted from real arXiv scientific papers. The model must answer questions that require reading axes, comparing data series, identifying trends, and drawing conclusions from visual scientific data. This benchmark specifically measures the intersection of visual understanding and scientific reasoning · a critical capability for research assistants and document analysis. Performance varies dramatically with and without tool use, making it a key differentiator for multimodal and agentic AI systems.
CharXiv Reasoning (with tools) · the tool-augmented variant of CharXiv Reasoning. Models can use code execution, image processing, and other tools to analyze charts from arXiv papers. Claude Mythos Preview scores 93.2% with tools vs 86.1% without · demonstrating how tool use dramatically improves visual scientific reasoning. The gap between tool-augmented and bare performance is a key signal for agent capability.
Agent
· 4 benchmarksLong-horizon tool use · multi-step autonomous tasks in realistic environments.
APEX-Agents · evaluates AI agents on complex, multi-step tasks requiring planning, tool use, and autonomous decision-making in realistic environments.
The Agent Company · tests AI agents on realistic corporate tasks like email management, code review, data analysis, and cross-tool workflows.
OSWorld · tests AI agents on real-world computer tasks across operating systems, including web browsing, file management, and application use.
GraphWalks BFS 256K-1M · a long-context reasoning benchmark created by OpenAI that tests whether a model can perform breadth-first search (BFS) traversal across massive graphs encoded in 256,000 to 1,024,000 tokens of context. The model receives a graph represented as an edge list and must follow parent-child relationships across the entire extended context window. This is not simple retrieval · it requires true relational reasoning over hundreds of thousands of tokens. The dataset includes 100 problems at 256K context and 100 problems at 1,024K context. Claude Mythos Preview leads at 80.0%, more than doubling Opus 4.6 (38.7%) and far exceeding GPT-5.4 (21.4%). The massive performance gap between models makes this one of the most discriminating benchmarks for real long-context capability in 2026.
Knowledge
· 95 benchmarksFactual recall, reading comprehension, creative writing, and game-playing tasks.
Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.
Artificial Analysis Coding Index · a composite score that aggregates performance across multiple coding benchmarks into a single index. Tracks code generation quality, debugging ability, multi-language competence, and real-world software engineering tasks. Used by Artificial Analysis to rank model coding capability in a normalized, comparable format. Useful for developers choosing between models for coding-heavy workloads.
Artificial Analysis Agentic Index · a composite score measuring how well a model performs in agentic workflows · multi-step tool use, planning, error recovery, and autonomous task completion. Aggregates results from multiple agentic benchmarks including SWE-bench, tool-use tests, and planning evaluations. The canonical single-number metric for "how good is this model as an agent?"
Fiction.LiveBench · a continuously updated benchmark using recently published fiction to test reading comprehension and reasoning, preventing data contamination.
Lech Mazur Writing · evaluates creative writing ability, assessing prose quality, narrative coherence, and stylistic sophistication.
AI2 Reasoning Challenge · tests grade-school level science knowledge with multiple-choice questions requiring reasoning beyond simple retrieval.
SimpleQA Verified · short factual questions with verified answers, measuring factual accuracy and the tendency to hallucinate or provide incorrect information.
GeoBench · tests geographic knowledge and spatial reasoning across countries, landmarks, coordinates, and geopolitical understanding.
PIQA (Physical Interaction QA) · tests intuitive physical reasoning by asking models to select the correct approach for everyday physical tasks.
TriviaQA · reading comprehension benchmark with trivia questions, requiring models to find and reason over evidence from provided documents.
OpenBookQA · science questions that require combining a given core fact with broad common knowledge, mimicking an open-book exam setting.
DeepResearch Bench · evaluates AI on complex multi-step research tasks requiring information gathering, synthesis, and producing comprehensive analyses.
LAMBADA · measures the ability to predict the final word of a passage, requiring broad contextual understanding across long text spans.
Frequently asked
Answers pulled from the dataset · updated daily
How many benchmarks does BenchGecko track?
BenchGecko tracks 128 AI benchmarks across 6 categories · 95 knowledge, 8 code, 11 reasoning, 4 agent, 5 multimodal, 5 math. 2893 total model-benchmark scores from 303 models.
Which benchmark is the hardest right now?
GPQA has the lowest average score at 6.7 · 67 models have been tested on it. This is the hardest of the 128 benchmarks we track based on mean score.
Which benchmark has the tightest race at the top?
Chatbot Arena Elo · Coding has the highest standard deviation within the top 5 models (σ=38.4) · meaning the race at the top is tightly contested and swings between releases.
Which benchmark has the most model coverage?
Chatbot Arena Elo · Overall has scores for 113 models · the most widely reported benchmark we track. The leader is Claude Opus 4.6 (Fast) at 1502.8.
How often is this data updated?
BenchGecko pulls benchmark scores from Epoch AI and cross-references against model release dates. The dataset is refreshed daily. Latest model release tracked · 2026-04-30.
Where do the benchmark scores come from?
Scores are sourced from Epoch AI benchmarks (CC-BY licensed), SWE-bench public leaderboards, and provider-published evals. Each benchmark detail page links to the original source.
See also
Keep exploring the BenchGecko graph