ScienceQA
ScienceQA · multimodal science questions spanning natural science, social science, and language science with diverse question formats and image context.
The Frontier
Best score over time · one chart, every benchmark
Full rankings
5 models tested · sorted by score
| # | Model | Score |
|---|---|---|
| 1 | 84.7 | |
| 2 | 84.7 | |
| 3 | 62.7 | |
| 4 | 41.0 | |
| 5 | 24.4 |
Score distribution
Where models cluster
Correlated benchmarks
Pearson r · original research
Benchmarks that track with ScienceQA
Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.
Frequently asked
About ScienceQA
What does ScienceQA measure?
ScienceQA · multimodal science questions spanning natural science, social science, and language science with diverse question formats and image context. 5 AI models have been tested on it. Scores range from 24.4 to 84.7 out of 100.
Which model leads on ScienceQA?
GPT-4o (2024-05-13) from OpenAI leads ScienceQA with a score of 84.7. The median score across 5 tested models is 62.7.
Is ScienceQA saturated?
No · the top score is 84.7 out of 100 (85%). There is still meaningful room for improvement on ScienceQA.
Does ScienceQA predict performance on other benchmarks?
Yes · ScienceQA scores correlate 0.99 with MMLU across 5 shared models. Models that do well on ScienceQA tend to do well on MMLU.
How often is ScienceQA data refreshed?
BenchGecko pulls updates daily. New model scores on ScienceQA appear as soon as they are published by Epoch AI or the model provider.
- Category
- Multimodal
- Max score
- 100
- Models
- 5
- Updated
- 2024-11-20
Top on ScienceQA
GPT-4o (2024-05-13) · 84.7GPT-4o (2024-11-20) · 84.7Claude 3 Haiku · 62.7Llama 2-13B · 41.0LLaMA-13B · 24.4More multimodal benchmarks
Same category · related evaluations