Which model leads on OpenBookQA?

phi-3-mini 3.8B from Microsoft leads OpenBookQA with a score of 84.0. The median score across 19 tested models is 52.3.

Is OpenBookQA saturated?

No · the top score is 84.0 out of 100 (84%). There is still meaningful room for improvement on OpenBookQA.

Yes · OpenBookQA scores correlate 0.86 with ANLI across 9 shared models. Models that do well on OpenBookQA tend to do well on ANLI.

BenchGecko pulls updates daily. New model scores on OpenBookQA appear as soon as they are published by Epoch AI or the model provider.

Benchmark · KnowledgeSettled

Name: OpenBookQA Benchmark
Creator: BenchGecko
License: https://creativecommons.org/licenses/by/4.0/

OpenBookQA · science questions that require combining a given core fact with broad common knowledge, mimicking an open-book exam setting.

Updated 2024-07-16

Models tested

Top score

84.0

phi-3-mini 3.8B

Median

52.3

min 14.4

Top-5 spread

σ 1.3

Settled

Best score over time · one chart, every benchmark

Chart type

Only 1 models have been tested on OpenBookQA · not enough history to compute a frontier yet.

Pink dots = frontier records · 0 totalClick to open model page

19 models tested · sorted by score

#	Model	Score	Price
1	phi-3-mini 3.8B· Microsoft	84.0	—
2	phi-3-small 7.4B· Microsoft	84.0	—
3	phi-3-medium 14B· Microsoft	83.2	—
4	GPT-3.5 Turbo (older v0613)· OpenAI	81.3	$1.00
5	Mixtral 8x7B Instruct· Mistral AI	81.1	$0.54
6	Llama 3 8B Instruct· Meta	76.8	$0.03
7	Mistral 7B V0.1· Mistral AI	73.1	—
8	Gemma 2B· Google DeepMind	71.5	—
9	Phi 2· Microsoft	64.8	—
10	Falcon-180B· TII	52.3	—
11	Llama 2-13B· Meta	42.7	—
12	LLaMA-13B· Meta	41.9	—
13	U MPT-30B· Unknown	36.0	—
14	Llama 3.1 405B· Meta	32.3	—
15	Llama 3 70B Instruct· Meta	30.1	$0.51
16	U XGen-7B· Unknown	20.3	—
17	U Dolly 2.0-12b· Unknown	18.9	—
18	Phi-1.5· Microsoft	16.3	—
19	Cerebras-GPT-13B· OpenAI	14.4	—