CharXiv Reasoning
CharXiv Reasoning · tests a model's ability to understand and reason about charts, plots, and figures extracted from real arXiv scientific papers. The model must answer questions that require reading axes, comparing data series, identifying trends, and drawing conclusions from visual scientific data. This benchmark specifically measures the intersection of visual understanding and scientific reasoning · a critical capability for research assistants and document analysis. Performance varies dramatically with and without tool use, making it a key differentiator for multimodal and agentic AI systems.
Tool-augmented models jump 7+ points. Mythos scores 93.2% with tools vs 86.1% without.
Scoring: Accuracy on chart reasoning questions. Each question has a verified correct answer. Evaluated in two modes: bare (no tools) and tool-augmented (code execution + image processing).
Full rankings
3 models tested · sorted by score · includes 3 verified scores
| # | Model | Score |
|---|---|---|
| 1 | 86.1 | |
| 2 | 82.1 | |
| 3 | 69.1 |
Score distribution
Where models cluster
Correlated benchmarks
Pearson r · original research
How it works
Evaluation methodology
CharXiv Reasoning presents models with charts, plots, and figures extracted from real arXiv papers, paired with questions that require both visual parsing and scientific reasoning. Questions range from reading axis values and comparing bar heights to interpreting trends, computing derived statistics, and drawing scientific conclusions from visualized data. The evaluation tests chart types including bar charts, line plots, scatter plots, heatmaps, and multi-panel figures from published research papers.
Industry relevance
Why teams track this benchmark
Scientific papers are overwhelmingly visual. Any AI research assistant that cannot interpret figures is fundamentally limited. CharXiv Reasoning measures the exact intersection of vision and scientific depth that production research tools require.
Practical takeaways
By role
If building a research assistant that processes scientific papers, test against CharXiv. The tool-augmented gap (7+ points) means giving your model code execution access pays off.
Mythos at 93.2% with tools signals near-human performance on scientific figure analysis. Companies building on this capability have a shrinking quality gap to close.
The with-tools vs without-tools delta (7+ points across models) is a clean signal for studying how tool use amplifies multimodal reasoning.
Frequently asked
About CharXiv Reasoning
What does CharXiv Reasoning measure?
CharXiv Reasoning · tests a model's ability to understand and reason about charts, plots, and figures extracted from real arXiv scientific papers. The model must answer questions that require reading axes, comparing data series, identifying trends, and drawing conclusions from visual scientific data. This benchmark specifically measures the intersection of visual understanding and scientific reasoning · a critical capability for research assistants and document analysis. Performance varies dramatically with and without tool use, making it a key differentiator for multimodal and agentic AI systems. 1 AI models have been tested on it. Scores range from 86.1 to 86.1 out of 100.
Which model leads on CharXiv Reasoning?
Claude Mythos Preview from Anthropic leads CharXiv Reasoning with a score of 86.1. The median score across 1 tested models is 86.1.
Is CharXiv Reasoning saturated?
No · the top score is 86.1 out of 100 (86%). There is still meaningful room for improvement on CharXiv Reasoning.
What makes CharXiv Reasoning distinctive?
CharXiv Reasoning is a multimodal benchmark with limited overlap to the rest of the catalog · it measures capabilities that are not well-covered by other benchmarks we track.
How often is CharXiv Reasoning data refreshed?
BenchGecko pulls updates daily. New model scores on CharXiv Reasoning appear as soon as they are published by Epoch AI or the model provider.
- Category
- Multimodal
- Creator
- Princeton NLP
- Max score
- 100
- Modality
- Multimodal
- Scoring
- Accuracy on chart reasoning questions. Each question has a verified correct answer. Evaluated in two modes: bare (no tools) and tool-augmented (code execution + image processing).
- Models
- 3
- Updated
- 2026-04-07
“CharXiv Reasoning is where multimodal meets scientific depth. If your use case involves reading research papers with figures, this is the benchmark to watch.”
More multimodal benchmarks
Same category · related evaluations