Which model leads on CharXiv Reasoning (with tools)?

Claude Mythos Preview from Anthropic leads CharXiv Reasoning (with tools) with a score of 93.2. The median score across 1 tested models is 93.2.

Is CharXiv Reasoning (with tools) saturated?

No · the top score is 93.2 out of 100 (93%). There is still meaningful room for improvement on CharXiv Reasoning (with tools).

What makes CharXiv Reasoning (with tools) distinctive?

CharXiv Reasoning (with tools) is a multimodal benchmark with limited overlap to the rest of the catalog · it measures capabilities that are not well-covered by other benchmarks we track.

How often is CharXiv Reasoning (with tools) data refreshed?

BenchGecko pulls updates daily. New model scores on CharXiv Reasoning (with tools) appear as soon as they are published by Epoch AI or the model provider.

Benchmark · MultimodalSettled

CharXiv Reasoning (with tools)

Name: CharXiv Reasoning (with tools) Benchmark
Creator: BenchGecko
License: https://creativecommons.org/licenses/by/4.0/

CharXiv Reasoning (with tools) · the tool-augmented variant of CharXiv Reasoning. Models can use code execution, image processing, and other tools to analyze charts from arXiv papers. Claude Mythos Preview scores 93.2% with tools vs 86.1% without · demonstrating how tool use dramatically improves visual scientific reasoning. The gap between tool-augmented and bare performance is a key signal for agent capability.

Updated 2026-04-07

Tool use adds 7+ percentage points consistently across models. Mythos leads at 93.2%, up from 86.1% bare.

Scoring: Same accuracy metric as CharXiv Reasoning. Tool use is optional per question · the model decides when to invoke code execution vs answering directly.

Models tested

Top score

93.2

Claude Mythos Preview

Median

93.2

min 93.2

Top-5 spread

σ 0.0

Settled

Full rankings

3 models tested · sorted by score · includes 3 verified scores

#	Model	Score	Price	Source
1	Claude Mythos Preview· Anthropic	93.2	—	Anthropic Mythos System Card (with-tools), Apr 2026
2	Claude Opus 4.7· Anthropicverified	91.0	—	Anthropic Opus 4.7 Announcement (with-tools), Apr 2026
3	Claude Opus 4.6· Anthropicverified	84.7	—	Anthropic Opus 4.6 System Card (with-tools), Feb 2026

Details

Category: Multimodal
Max score: 100
Scoring: Same accuracy metric as CharXiv Reasoning. Tool use is optional per question · the model decides when to invoke code execution vs answering directly.
Models: 3
Updated: 2026-04-07

Links

Anthropic Mythos System Card, Apr 2026 Anthropic Opus 4.7 Announcement, Apr 2026

Gecko's Take

“The with-tools variant is the real benchmark. Any production deployment of a research assistant will have tool access. Bare scores are academic; tool-augmented scores predict production quality.”

Related benchmarks

VPCT22 models VideoMME8 models ScienceQA5 models CharXiv Reasoning1 models