Which model leads on TriviaQA?

Claude 2 from Anthropic leads TriviaQA with a score of 87.5. The median score across 20 tested models is 78.4.

Is TriviaQA saturated?

No · the top score is 87.5 out of 100 (88%). There is still meaningful room for improvement on TriviaQA.

Yes · TriviaQA scores correlate 0.90 with HellaSwag across 16 shared models. Models that do well on TriviaQA tend to do well on HellaSwag.

BenchGecko pulls updates daily. New model scores on TriviaQA appear as soon as they are published by Epoch AI or the model provider.

Benchmark · KnowledgeSettled

Name: TriviaQA Benchmark
Creator: BenchGecko
License: https://creativecommons.org/licenses/by/4.0/

TriviaQA · reading comprehension benchmark with trivia questions, requiring models to find and reason over evidence from provided documents.

Updated 2024-12-26

Models tested

Top score

87.5

Claude 2

Median

78.4

min 45.2

Top-5 spread

σ 1.8

Settled

Best score over time · one chart, every benchmark

Chart type

Only 3 models have been tested on TriviaQA · not enough history to compute a frontier yet.

Pink dots = frontier records · 0 totalClick to open model page

20 models tested · sorted by score

#	Model	Score	Price
1	Claude 2· Anthropic	87.5	—
2	GPT-3.5 Turbo (older v0613)· OpenAI	85.8	$1.00
3	GPT-4 Turbo· OpenAI	84.8	$10.00
4	DeepSeek V3· DeepSeek	82.9	$0.32
5	Llama 3.1 405B· Meta	82.7	—
6	Mixtral 8x7B Instruct· Mistral AI	82.2	$0.54
7	DeepSeek-V2 (MoE-236B, May 2024)· DeepSeek	80.0	—
8	Falcon-180B· TII	79.9	—
9	Llama 2-13B· Meta	79.6	—
10	Claude Instant· Anthropic	78.9	—
11	LLaMA-13B· Meta	77.9	—
12	Mistral 7B V0.1· Mistral AI	75.2	—
13	phi-3-medium 14B· Microsoft	73.9	—
14	U MPT-30B· Unknown	73.6	—
15	Qwen2.5 72B Instruct· Alibaba Qwen	71.9	$0.36
16	Llama 3 8B Instruct· Meta	67.7	$0.03
17	phi-3-mini 3.8B· Microsoft	64.0	—
18	phi-3-small 7.4B· Microsoft	58.1	—
19	Gemma 2B· Google DeepMind	53.2	—
20	Phi 2· Microsoft	45.2	—