How often does LiveBench update?

New tasks drop monthly. "Recent" scores average the last 90 days of tasks.

Which LiveBench category matters most?

Depends on your workload. Reasoning and coding are the most-cited. Check the specific subcategory most aligned with your use case.

BenchmarksReading · ~3 min · 73 words deep

LiveBench

A benchmark that refreshes its tasks monthly so models can't memorize them · tests reasoning, coding, math, data, and instruction following.

LiveBench leaderboard

TL;DR

A benchmark that refreshes its tasks monthly so models can't memorize them · tests reasoning, coding, math, data, and instruction following.

Level 1

Basic

LiveBench was created in 2024 to combat benchmark contamination. Frontier models increasingly see benchmark problems during training, which inflates scores. LiveBench releases new tasks every month across six categories: reasoning, coding, math, data analysis, instruction following, and language. Only recent task releases count toward scores, ensuring models can't have seen them.

Level 2

Deep

LiveBench ships monthly new-task releases. The leaderboard shows both "recent" scores (last 3 months) and "all-time" scores. Categories weight equally: reasoning (logic puzzles, planning), coding (novel problems beyond HumanEval), mathematics (AMC-style), data analysis (table-based questions), instruction following (strict format compliance), and language (complex writing). Frontier models in 2026: Claude Mythos Preview leads "recent" at ~73%, GPT-5 at ~71%, DeepSeek V3.2 at ~65%. The "all-time" leaderboard ages out contaminated scores as tasks are deprecated.

Level 3

Expert

LiveBench's contamination-resistance works through strict release cadence and objective grading. Each monthly drop contains 30-60 new tasks per category, hand-crafted by the LiveBench team and held out for ~2 months before public evaluation. Grading is deterministic · no LLM-as-judge, only verifiable answers. The "recent" score averages the last 90 days of tasks. This directly addresses Schaeffer et al.'s critique that static benchmarks lose signal once they're in pretraining data.

Why this matters now

LiveBench is replacing MMLU-Pro as the default "reasoning" benchmark cited in 2026 model launches because its freshness is harder to game.

The takeaway for you

Depending on why you're here

If you are a

Researcher

·Monthly new-task drops · contamination-resistant by design
·Six equal-weighted categories · recent + all-time leaderboards
·Deterministic grading, no LLM-as-judge

If you are a

Builder

·LiveBench recent scores are a better signal than static benchmarks
·Check the coding subcategory if you care about real-world code generation
·Track specific subcategories that match your workload

If you are a

Investor

·LiveBench methodology signals the benchmark era shift post-MMLU saturation
·Labs that optimize for LiveBench "recent" are harder to game than those gaming static benchmarks
·Monthly cadence creates news cycle · good for content

If you are a

Curious · Normie

·The AI test that keeps changing so AI can't cheat by memorizing
·New questions drop every month
·More honest view of how smart a model is today

Gecko's take

LiveBench is winning the post-MMLU era because freshness beats scale. Every serious benchmark will need an anti-contamination story by 2027.

Frequently Asked Questions

To address contamination · many benchmark problems appear in training data, inflating scores. Monthly fresh tasks solve this.

Canonical sources

Read the primary sources

LiveBench paper (2024)arxiv.org
Live leaderboardlivebench.ai

LiveBench

Basic

Deep

Expert

Depending on why you're here

Frequently Asked Questions

Read the primary sources

Related terms

Glossary

Explore live data

Cite or embed