Benchmark · MultimodalSettled

CharXiv Reasoning

CharXiv Reasoning · tests a model's ability to understand and reason about charts, plots, and figures extracted from real arXiv scientific papers. The model must answer questions that require reading axes, comparing data series, identifying trends, and drawing conclusions from visual scientific data. This benchmark specifically measures the intersection of visual understanding and scientific reasoning · a critical capability for research assistants and document analysis. Performance varies dramatically with and without tool use, making it a key differentiator for multimodal and agentic AI systems.

Updated 2026-04-07

Tool-augmented models jump 7+ points. Mythos scores 93.2% with tools vs 86.1% without.

Scoring: Accuracy on chart reasoning questions. Each question has a verified correct answer. Evaluated in two modes: bare (no tools) and tool-augmented (code execution + image processing).

Models tested
3
Top score
86.1
Claude Mythos Preview
Median
86.1
min 86.1
Top-5 spread
σ 0.0
Settled
CHARXIV REASONING \u00B7 TOP 30255075100#1Claude Mythos Preview86.1#2Claude Opus 4.7VERIFIED82.1#3Claude Opus 4.6VERIFIED69.1benchgecko.ai/benchmark/charxiv-reasoning
Details
Category
Multimodal
Creator
Princeton NLP
Max score
100
Modality
Multimodal
Scoring
Accuracy on chart reasoning questions. Each question has a verified correct answer. Evaluated in two modes: bare (no tools) and tool-augmented (code execution + image processing).
Models
3
Updated
2026-04-07
Tests
Chart readingScientific reasoningVisual data interpretationMulti-panel figures
Does not test
CodeLong contextSpeedTool useText-only reasoning
Gecko's Take

CharXiv Reasoning is where multimodal meets scientific depth. If your use case involves reading research papers with figures, this is the benchmark to watch.

Same category · related evaluations