Research note · Live dataset

Benchmark Saturation Watch

A live BenchGecko note tracking where public AI benchmarks are crowded enough to support comparison, where coverage is still thin, and where leaderboard claims need stronger evidence.

Dataset date
May 6, 2026
BenchGecko generated data
128
Tracked benchmarks
126 with source links
2,570
Score cells
clean benchmark score records
47
Crowded records
25 or more scored models
51
Sparse records
1 to 9 scored models
Finding 01

The index is useful, but coverage is uneven.

243 of 994 tracked models currently have at least one clean benchmark score. Only 100 models have 10 or more clean score records, so broad model rankings should still expose coverage beside position.

Finding 02

Crowded benchmarks are the first saturation candidates.

47 benchmark records now have 25 or more scored models. These are strong places to watch score clustering, benchmark gaming, and whether newer frontier releases still separate from the field.

Finding 03

Sparse benchmarks still matter, but they need caution labels.

51 benchmark records have fewer than 10 scored models. They can be directional, especially for specialist tasks, but should not carry the same confidence as crowded evaluations.

What this note means

How to read benchmark saturation without overstating it.

Benchmarks are not equally mature. A benchmark with many scored models can support stronger comparative claims than a benchmark with two or three visible results.

Saturation here means coverage density first. It does not automatically mean a benchmark is solved. The next layer is score clustering, source quality, contamination risk, and task relevance.

The practical rule is simple: show benchmark coverage beside model rank, use sparse records as directional evidence, and keep deployment choices tied to price, latency, context, provider reliability, and compute constraints.

Next research paths

Benchmark coverage is one layer. Deployment choice needs the rest.

Use the benchmark index for evidence, model pages for score context, pricing pages for API cost, and compute pages for infrastructure pressure.