Can I use Chatbot Arena to compare my own model?

Yes. Providers can request inclusion if the model is publicly accessible or hostable by Arena operators.

Why do Claude and GPT swap leads on Arena?

Voting distributions shift week to week. Arena's confidence intervals typically overlap for the top 3-5 models, making the "leader" a matter of the last 2 weeks of votes.

BenchmarksReading · ~3 min · 88 words deep

Chatbot Arena

A blind pairwise AI model comparison where humans vote · outputs are anonymous and Elo ratings produce the leaderboard.

Chatbot Arena · LMSYS

TL;DR

A blind pairwise AI model comparison where humans vote · outputs are anonymous and Elo ratings produce the leaderboard.

Level 1

Basic

Chatbot Arena (now LM Arena) is a public platform where anyone can type a prompt and receive responses from two anonymous AI models side-by-side. The user picks a winner. Millions of votes feed an Elo-style rating system. The resulting leaderboard is widely cited as the consumer-preference benchmark · the gap between scores reflects how often humans prefer one model over another on real queries.

Level 2

Deep

Chatbot Arena (LMSYS, UC Berkeley) has collected 15M+ votes since 2023. The rating is Bradley-Terry Elo updated in real time. Models are presented anonymously as "Model A" vs "Model B"; users choose A, B, tie, or both bad. Category filters (coding, reasoning, hard prompts, math) slice the leaderboard by query type. The overall leaderboard is dominated by general-purpose chat quality; specialized categories rank differently. Methodological concerns: voter self-selection bias, short queries dominating, style-over-substance preferences. Despite critiques, Arena is the closest thing to a real-world preference measure in 2026.

Level 3

Expert

The Bradley-Terry model estimates each model's strength θ such that P(A beats B) = 1 / (1 + e^(θ_B - θ_A)). Online updates use stochastic gradient on cross-entropy loss of the win-loss matrix. Confidence intervals via bootstrap resampling · typically ±5-15 Elo at 90% CI. Category leaderboards use the same model but subset prompts by classifier-tagged category. Known biases: verbosity bias (longer responses preferred), markdown/formatting bias, and language-distribution bias. Arena Hard (a curated subset) attempts to control for these with harder prompts.

Why this matters now

Arena is the only benchmark that measures real user preference at scale · every frontier launch now publishes an Arena score alongside benchmark results.

The takeaway for you

Depending on why you're here

If you are a

Researcher

·Bradley-Terry Elo on 15M+ human votes · real-time updates
·Category leaderboards: coding, reasoning, math, hard prompts
·Known biases: verbosity, formatting, language distribution

If you are a

Builder

·Use Arena Elo as a proxy for user satisfaction
·Check category leaderboards matching your use case
·Arena Hard is a tougher subset worth checking separately

If you are a

Investor

·Arena ranking drives API revenue · top-3 Arena models capture most demand
·Arena leadership shift (GPT → Claude → Gemini) telegraphs market share shifts
·Watch Arena rating delta between releases as a speed-of-progress signal

If you are a

Curious · Normie

·Humans vote for their favorite AI answer · anonymously
·Like a blind taste test for AI models
·Top models on the leaderboard are the ones users actually prefer

Don't mix them up

Often confused with

Chatbot ArenavsMT-Bench

MT-Bench uses LLM-as-judge on 80 curated multi-turn prompts. Arena uses human voters on anonymous live chats. Arena is bigger scale, MT-Bench is more controlled.

Gecko's take

Arena is the only benchmark journalists and execs actually read. Every frontier lab now optimizes for Arena Elo as carefully as for MMLU.

Frequently Asked Questions

Originally LMSYS (UC Berkeley spin-out). Rebranded to LM Arena in 2024. Now operated by LMArena.ai.

Canonical sources

Read the primary sources

Chatbot Arena paper (2024)arxiv.org
Live arenalmarena.ai

Chatbot Arena

Basic

Deep

Expert

Depending on why you're here

Often confused with

Frequently Asked Questions

Read the primary sources

Related terms

Glossary

Explore live data

Cite or embed