Which model leads on ScienceQA?

GPT-4o (2024-05-13) from OpenAI leads ScienceQA with a score of 84.7. The median score across 5 tested models is 62.7.

Is ScienceQA saturated?

No · the top score is 84.7 out of 100 (85%). There is still meaningful room for improvement on ScienceQA.

Does ScienceQA predict performance on other benchmarks?

Yes · ScienceQA scores correlate 0.99 with MMLU across 5 shared models. Models that do well on ScienceQA tend to do well on MMLU.

How often is ScienceQA data refreshed?

BenchGecko pulls updates daily. New model scores on ScienceQA appear as soon as they are published by Epoch AI or the model provider.

Benchmark · MultimodalWide open

ScienceQA

Name: ScienceQA Benchmark
Creator: BenchGecko
License: https://creativecommons.org/licenses/by/4.0/

ScienceQA · multimodal science questions spanning natural science, social science, and language science with diverse question formats and image context.

Updated 2024-11-20

Models tested

Top score

84.7

GPT-4o (2024-05-13)

Median

62.7

min 24.4

Top-5 spread

σ 23.9

wide open

The Frontier

Best score over time · one chart, every benchmark

Chart type

Only 1 models have been tested on ScienceQA · not enough history to compute a frontier yet.

Pink dots = frontier records · 0 totalClick to open model page

Full rankings

5 models tested · sorted by score

#	Model	Score	Price
1	GPT-4o (2024-05-13)· OpenAI	84.7	$5.00
2	GPT-4o (2024-11-20)· OpenAI	84.7	$2.50
3	Claude 3 Haiku· Anthropic	62.7	$0.25
4	Llama 2-13B· Meta	41.0	—
5	LLaMA-13B· Meta	24.4	—