Beta
BenchmarksReading · ~3 min · 73 words deep

LiveBench

A benchmark that refreshes its tasks monthly so models can't memorize them · tests reasoning, coding, math, data, and instruction following.

LiveBench leaderboard
TL;DR

A benchmark that refreshes its tasks monthly so models can't memorize them · tests reasoning, coding, math, data, and instruction following.

Level 1

LiveBench was created in 2024 to combat benchmark contamination. Frontier models increasingly see benchmark problems during training, which inflates scores. LiveBench releases new tasks every month across six categories: reasoning, coding, math, data analysis, instruction following, and language. Only recent task releases count toward scores, ensuring models can't have seen them.

Level 2

LiveBench ships monthly new-task releases. The leaderboard shows both "recent" scores (last 3 months) and "all-time" scores. Categories weight equally: reasoning (logic puzzles, planning), coding (novel problems beyond HumanEval), mathematics (AMC-style), data analysis (table-based questions), instruction following (strict format compliance), and language (complex writing). Frontier models in 2026: Claude Mythos Preview leads "recent" at ~73%, GPT-5 at ~71%, DeepSeek V3.2 at ~65%. The "all-time" leaderboard ages out contaminated scores as tasks are deprecated.

Level 3

LiveBench's contamination-resistance works through strict release cadence and objective grading. Each monthly drop contains 30-60 new tasks per category, hand-crafted by the LiveBench team and held out for ~2 months before public evaluation. Grading is deterministic · no LLM-as-judge, only verifiable answers. The "recent" score averages the last 90 days of tasks. This directly addresses Schaeffer et al.'s critique that static benchmarks lose signal once they're in pretraining data.

Why this matters now

LiveBench is replacing MMLU-Pro as the default "reasoning" benchmark cited in 2026 model launches because its freshness is harder to game.

The takeaway for you
If you are a
Researcher
  • ·Monthly new-task drops · contamination-resistant by design
  • ·Six equal-weighted categories · recent + all-time leaderboards
  • ·Deterministic grading, no LLM-as-judge
If you are a
Builder
  • ·LiveBench recent scores are a better signal than static benchmarks
  • ·Check the coding subcategory if you care about real-world code generation
  • ·Track specific subcategories that match your workload
If you are a
Investor
  • ·LiveBench methodology signals the benchmark era shift post-MMLU saturation
  • ·Labs that optimize for LiveBench "recent" are harder to game than those gaming static benchmarks
  • ·Monthly cadence creates news cycle · good for content
If you are a
Curious · Normie
  • ·The AI test that keeps changing so AI can't cheat by memorizing
  • ·New questions drop every month
  • ·More honest view of how smart a model is today
Gecko's take

LiveBench is winning the post-MMLU era because freshness beats scale. Every serious benchmark will need an anti-contamination story by 2027.

To address contamination · many benchmark problems appear in training data, inflating scores. Monthly fresh tasks solve this.
Canonical sources