OpenBookQA
OpenBookQA · science questions that require combining a given core fact with broad common knowledge, mimicking an open-book exam setting.
The Frontier
Best score over time · one chart, every benchmark
Full rankings
19 models tested · sorted by score
| # | Model | Score |
|---|---|---|
| 1 | 84.0 | |
| 2 | 84.0 | |
| 3 | 83.2 | |
| 4 | 81.3 | |
| 5 | 81.1 | |
| 6 | 76.8 | |
| 7 | 73.1 | |
| 8 | 71.5 | |
| 9 | 64.8 | |
| 10 | 52.3 | |
| 11 | 42.7 | |
| 12 | 41.9 | |
| 13 | U MPT-30B | 36.0 |
| 14 | 32.3 | |
| 15 | 30.1 | |
| 16 | U XGen-7B | 20.3 |
| 17 | U Dolly 2.0-12b | 18.9 |
| 18 | 16.3 | |
| 19 | 14.4 |
Score distribution
Where models cluster
Correlated benchmarks
Pearson r · original research
Benchmarks that track with OpenBookQA
Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.
Frequently asked
About OpenBookQA
What does OpenBookQA measure?
OpenBookQA · science questions that require combining a given core fact with broad common knowledge, mimicking an open-book exam setting. 19 AI models have been tested on it. Scores range from 14.4 to 84.0 out of 100.
Which model leads on OpenBookQA?
phi-3-mini 3.8B from Microsoft leads OpenBookQA with a score of 84.0. The median score across 19 tested models is 52.3.
Is OpenBookQA saturated?
No · the top score is 84.0 out of 100 (84%). There is still meaningful room for improvement on OpenBookQA.
Does OpenBookQA predict performance on other benchmarks?
Yes · OpenBookQA scores correlate 0.86 with ANLI across 9 shared models. Models that do well on OpenBookQA tend to do well on ANLI.
How often is OpenBookQA data refreshed?
BenchGecko pulls updates daily. New model scores on OpenBookQA appear as soon as they are published by Epoch AI or the model provider.
- Category
- Knowledge
- Max score
- 100
- Models
- 19
- Updated
- 2024-07-16
Top on OpenBookQA
phi-3-mini 3.8B · 84.0phi-3-small 7.4B · 84.0phi-3-medium 14B · 83.2GPT-3.5 Turbo (older v0613) · 81.3Mixtral 8x7B Instruct · 81.1More knowledge benchmarks
Same category · related evaluations