Research note · Live dataset

Benchmark Saturation Watch

A live BenchGecko note tracking where public AI benchmarks are crowded enough to support comparison, where coverage is still thin, and where leaderboard claims need stronger evidence.

Dataset date

May 6, 2026

BenchGecko generated data

128

Tracked benchmarks

126 with source links

2,570

Score cells

clean benchmark score records

Crowded records

25 or more scored models

Sparse records

1 to 9 scored models

Finding 01

The index is useful, but coverage is uneven.

243 of 994 tracked models currently have at least one clean benchmark score. Only 100 models have 10 or more clean score records, so broad model rankings should still expose coverage beside position.

Finding 02

Crowded benchmarks are the first saturation candidates.

47 benchmark records now have 25 or more scored models. These are strong places to watch score clustering, benchmark gaming, and whether newer frontier releases still separate from the field.

Finding 03

Sparse benchmarks still matter, but they need caution labels.

51 benchmark records have fewer than 10 scored models. They can be directional, especially for specialist tasks, but should not carry the same confidence as crowded evaluations.

Most crowded benchmark records

High coverage is where saturation monitoring starts.

OTIS Mock AIME 2024-2025

scores

03·general·source pending

BBH (HuggingFace)

scores

04·knowledge·hf-leaderboard

GPQA

scores

05·language·hf-leaderboard

IFEval

scores

06·math·hf-leaderboard

MATH Level 5

scores

07·knowledge·hf-leaderboard

MMLU-PRO

scores

08·reasoning·hf-leaderboard

FrontierMath-2025-02-28-Private

scores

What this note means

How to read benchmark saturation without overstating it.

Benchmarks are not equally mature. A benchmark with many scored models can support stronger comparative claims than a benchmark with two or three visible results.

Saturation here means coverage density first. It does not automatically mean a benchmark is solved. The next layer is score clustering, source quality, contamination risk, and task relevance.

The practical rule is simple: show benchmark coverage beside model rank, use sparse records as directional evidence, and keep deployment choices tied to price, latency, context, provider reliability, and compute constraints.

Sparse watchlist

Useful records that need more model coverage.

CharXiv Reasoning

reasoning

1 scores

CharXiv Reasoning (with tools)

reasoning

1 scores

GraphWalks BFS 256K-1M

Next research paths

Benchmark coverage is one layer. Deployment choice needs the rest.

Use the benchmark index for evidence, model pages for score context, pricing pages for API cost, and compute pages for infrastructure pressure.

Open benchmarks Open models