Chatbot Arena
A blind pairwise AI model comparison where humans vote · outputs are anonymous and Elo ratings produce the leaderboard.
A blind pairwise AI model comparison where humans vote · outputs are anonymous and Elo ratings produce the leaderboard.
Basic
Chatbot Arena (now LM Arena) is a public platform where anyone can type a prompt and receive responses from two anonymous AI models side-by-side. The user picks a winner. Millions of votes feed an Elo-style rating system. The resulting leaderboard is widely cited as the consumer-preference benchmark · the gap between scores reflects how often humans prefer one model over another on real queries.
Deep
Chatbot Arena (LMSYS, UC Berkeley) has collected 15M+ votes since 2023. The rating is Bradley-Terry Elo updated in real time. Models are presented anonymously as "Model A" vs "Model B"; users choose A, B, tie, or both bad. Category filters (coding, reasoning, hard prompts, math) slice the leaderboard by query type. The overall leaderboard is dominated by general-purpose chat quality; specialized categories rank differently. Methodological concerns: voter self-selection bias, short queries dominating, style-over-substance preferences. Despite critiques, Arena is the closest thing to a real-world preference measure in 2026.
Expert
The Bradley-Terry model estimates each model's strength θ such that P(A beats B) = 1 / (1 + e^(θ_B - θ_A)). Online updates use stochastic gradient on cross-entropy loss of the win-loss matrix. Confidence intervals via bootstrap resampling · typically ±5-15 Elo at 90% CI. Category leaderboards use the same model but subset prompts by classifier-tagged category. Known biases: verbosity bias (longer responses preferred), markdown/formatting bias, and language-distribution bias. Arena Hard (a curated subset) attempts to control for these with harder prompts.
Arena is the only benchmark that measures real user preference at scale · every frontier launch now publishes an Arena score alongside benchmark results.
Depending on why you're here
- ·Bradley-Terry Elo on 15M+ human votes · real-time updates
- ·Category leaderboards: coding, reasoning, math, hard prompts
- ·Known biases: verbosity, formatting, language distribution
- ·Use Arena Elo as a proxy for user satisfaction
- ·Check category leaderboards matching your use case
- ·Arena Hard is a tougher subset worth checking separately
- ·Arena ranking drives API revenue · top-3 Arena models capture most demand
- ·Arena leadership shift (GPT → Claude → Gemini) telegraphs market share shifts
- ·Watch Arena rating delta between releases as a speed-of-progress signal
- ·Humans vote for their favorite AI answer · anonymously
- ·Like a blind taste test for AI models
- ·Top models on the leaderboard are the ones users actually prefer
Often confused with
MT-Bench uses LLM-as-judge on 80 curated multi-turn prompts. Arena uses human voters on anonymous live chats. Arena is bigger scale, MT-Bench is more controlled.
Arena is the only benchmark journalists and execs actually read. Every frontier lab now optimizes for Arena Elo as carefully as for MMLU.
Frequently Asked Questions
Read the primary sources
- Chatbot Arena paper (2024)arxiv.org
- Live arenalmarena.ai