LiveBench
A benchmark that refreshes its tasks monthly so models can't memorize them · tests reasoning, coding, math, data, and instruction following.
A benchmark that refreshes its tasks monthly so models can't memorize them · tests reasoning, coding, math, data, and instruction following.
Basic
LiveBench was created in 2024 to combat benchmark contamination. Frontier models increasingly see benchmark problems during training, which inflates scores. LiveBench releases new tasks every month across six categories: reasoning, coding, math, data analysis, instruction following, and language. Only recent task releases count toward scores, ensuring models can't have seen them.
Deep
LiveBench ships monthly new-task releases. The leaderboard shows both "recent" scores (last 3 months) and "all-time" scores. Categories weight equally: reasoning (logic puzzles, planning), coding (novel problems beyond HumanEval), mathematics (AMC-style), data analysis (table-based questions), instruction following (strict format compliance), and language (complex writing). Frontier models in 2026: Claude Mythos Preview leads "recent" at ~73%, GPT-5 at ~71%, DeepSeek V3.2 at ~65%. The "all-time" leaderboard ages out contaminated scores as tasks are deprecated.
Expert
LiveBench's contamination-resistance works through strict release cadence and objective grading. Each monthly drop contains 30-60 new tasks per category, hand-crafted by the LiveBench team and held out for ~2 months before public evaluation. Grading is deterministic · no LLM-as-judge, only verifiable answers. The "recent" score averages the last 90 days of tasks. This directly addresses Schaeffer et al.'s critique that static benchmarks lose signal once they're in pretraining data.
LiveBench is replacing MMLU-Pro as the default "reasoning" benchmark cited in 2026 model launches because its freshness is harder to game.
Depending on why you're here
- ·Monthly new-task drops · contamination-resistant by design
- ·Six equal-weighted categories · recent + all-time leaderboards
- ·Deterministic grading, no LLM-as-judge
- ·LiveBench recent scores are a better signal than static benchmarks
- ·Check the coding subcategory if you care about real-world code generation
- ·Track specific subcategories that match your workload
- ·LiveBench methodology signals the benchmark era shift post-MMLU saturation
- ·Labs that optimize for LiveBench "recent" are harder to game than those gaming static benchmarks
- ·Monthly cadence creates news cycle · good for content
- ·The AI test that keeps changing so AI can't cheat by memorizing
- ·New questions drop every month
- ·More honest view of how smart a model is today
LiveBench is winning the post-MMLU era because freshness beats scale. Every serious benchmark will need an anti-contamination story by 2027.
Frequently Asked Questions
Read the primary sources
- LiveBench paper (2024)arxiv.org
- Live leaderboardlivebench.ai