Which model leads on CharXiv Reasoning?

Claude Mythos Preview from Anthropic leads CharXiv Reasoning with a score of 86.1. The median score across 1 tested models is 86.1.

Is CharXiv Reasoning saturated?

No · the top score is 86.1 out of 100 (86%). There is still meaningful room for improvement on CharXiv Reasoning.

What makes CharXiv Reasoning distinctive?

CharXiv Reasoning is a multimodal benchmark with limited overlap to the rest of the catalog · it measures capabilities that are not well-covered by other benchmarks we track.

How often is CharXiv Reasoning data refreshed?

BenchGecko pulls updates daily. New model scores on CharXiv Reasoning appear as soon as they are published by Epoch AI or the model provider.

Benchmark · MultimodalSettled

CharXiv Reasoning

Name: CharXiv Reasoning Benchmark
Creator: BenchGecko
License: https://creativecommons.org/licenses/by/4.0/

CharXiv Reasoning · tests a model's ability to understand and reason about charts, plots, and figures extracted from real arXiv scientific papers. The model must answer questions that require reading axes, comparing data series, identifying trends, and drawing conclusions from visual scientific data. This benchmark specifically measures the intersection of visual understanding and scientific reasoning · a critical capability for research assistants and document analysis. Performance varies dramatically with and without tool use, making it a key differentiator for multimodal and agentic AI systems.

Updated 2026-04-07

Tool-augmented models jump 7+ points. Mythos scores 93.2% with tools vs 86.1% without.

Scoring: Accuracy on chart reasoning questions. Each question has a verified correct answer. Evaluated in two modes: bare (no tools) and tool-augmented (code execution + image processing).

Models tested

Top score

86.1

Claude Mythos Preview

Median

86.1

min 86.1

Top-5 spread

σ 0.0

Settled

Full rankings

3 models tested · sorted by score · includes 3 verified scores

#	Model	Score	Price	Source
1	Claude Mythos Preview· Anthropic	86.1	—	Anthropic Mythos System Card (no-tools), Apr 2026
2	Claude Opus 4.7· Anthropicverified	82.1	—	Anthropic Opus 4.7 Announcement (no-tools), Apr 2026
3	Claude Opus 4.6· Anthropicverified	69.1	—	Anthropic Opus 4.6 System Card (no-tools), Feb 2026

Details

Category: Multimodal
Creator: Princeton NLP
Max score: 100
Modality: Multimodal
Scoring: Accuracy on chart reasoning questions. Each question has a verified correct answer. Evaluated in two modes: bare (no tools) and tool-augmented (code execution + image processing).
Models: 3
Updated: 2026-04-07

Tests

Chart readingScientific reasoningVisual data interpretationMulti-panel figures

Does not test

CodeLong contextSpeedTool useText-only reasoning

Links

Anthropic Mythos System Card, Apr 2026 Anthropic Opus 4.6 System Card, Feb 2026

Gecko's Take

“CharXiv Reasoning is where multimodal meets scientific depth. If your use case involves reading research papers with figures, this is the benchmark to watch.”

Related benchmarks

VPCT22 models VideoMME8 models ScienceQA5 models CharXiv Reasoning (with tools)1 models