Benchmarks · EvaluationsLive · 128 benchmarks · 2,893 scores · 303 models

Every AI Benchmark · Tracked

The most complete catalog of AI evaluations · every benchmark categorized, ranked across every model, and plotted against time on the frontier. Correlation analysis and live shortlists included.

Frontier trackedCross-model scoredCorrelation analysisRefreshed daily
Compare models

Best score over time · one chart, every benchmark

CHATBOT ARENA ELO · OVERALL98 MODELS · FRONTIER RUNNING MAX040080012001600SCORE ↑Jun 24Dec 24May 25Oct 25Apr 26RELEASE DATE →benchgecko.ai/benchmark/arena-elo-overall · frontier
Frontier on Chatbot Arena Elo · Overall rose from 1358.2 to 1502.8 in 16 months · +144.6 points · latest leader Claude Opus 4.6 (Fast) from Anthropic.
Pink dots = frontier records · 8 totalClick to open model page

Top 5 models for common tasks · rankings refresh every time the data does, no editorial picks.

Follow benchmark winners into pricing, provider, model, and comparison views

6 hand-curated buckets · every benchmark assigned

· 11 benchmarks

Multi-step logic, abstract pattern recognition, and adversarial inference tasks.

· 8 benchmarks

Software engineering tasks · real bug fixes, multi-language code editing, terminal proficiency.

· 5 benchmarks

Grade-school arithmetic through competition math and frontier research problems.

· 5 benchmarks

Vision, video, and cross-modal tasks combining text with other input types.

· 4 benchmarks

Long-horizon tool use · multi-step autonomous tasks in realistic environments.

· 95 benchmarks

Factual recall, reading comprehension, creative writing, and game-playing tasks.

Answers pulled from the dataset · updated daily

How many benchmarks does BenchGecko track?

BenchGecko tracks 128 AI benchmarks across 6 categories · 95 knowledge, 8 code, 11 reasoning, 4 agent, 5 multimodal, 5 math. 2893 total model-benchmark scores from 303 models.

Which benchmark is the hardest right now?

GPQA has the lowest average score at 6.7 · 67 models have been tested on it. This is the hardest of the 128 benchmarks we track based on mean score.

Which benchmark has the tightest race at the top?

Chatbot Arena Elo · Coding has the highest standard deviation within the top 5 models (σ=38.4) · meaning the race at the top is tightly contested and swings between releases.

Which benchmark has the most model coverage?

Chatbot Arena Elo · Overall has scores for 113 models · the most widely reported benchmark we track. The leader is Claude Opus 4.6 (Fast) at 1502.8.

How often is this data updated?

BenchGecko pulls benchmark scores from Epoch AI and cross-references against model release dates. The dataset is refreshed daily. Latest model release tracked · 2026-04-30.

Where do the benchmark scores come from?

Scores are sourced from Epoch AI benchmarks (CC-BY licensed), SWE-bench public leaderboards, and provider-published evals. Each benchmark detail page links to the original source.

Keep exploring the BenchGecko graph