Benchmark · MultimodalSettled

CharXiv Reasoning (with tools)

CharXiv Reasoning (with tools) · the tool-augmented variant of CharXiv Reasoning. Models can use code execution, image processing, and other tools to analyze charts from arXiv papers. Claude Mythos Preview scores 93.2% with tools vs 86.1% without · demonstrating how tool use dramatically improves visual scientific reasoning. The gap between tool-augmented and bare performance is a key signal for agent capability.

Updated 2026-04-07

Tool use adds 7+ percentage points consistently across models. Mythos leads at 93.2%, up from 86.1% bare.

Scoring: Same accuracy metric as CharXiv Reasoning. Tool use is optional per question · the model decides when to invoke code execution vs answering directly.

Models tested
3
Top score
93.2
Claude Mythos Preview
Median
93.2
min 93.2
Top-5 spread
σ 0.0
Settled
CHARXIV REASONING (WITH TOOLS) \u00B7 TOP 30255075100#1Claude Mythos Preview93.2#2Claude Opus 4.7VERIFIED91.0#3Claude Opus 4.6VERIFIED84.7benchgecko.ai/benchmark/charxiv-reasoning-tools
Details
Category
Multimodal
Max score
100
Scoring
Same accuracy metric as CharXiv Reasoning. Tool use is optional per question · the model decides when to invoke code execution vs answering directly.
Models
3
Updated
2026-04-07
Gecko's Take

The with-tools variant is the real benchmark. Any production deployment of a research assistant will have tool access. Bare scores are academic; tool-augmented scores predict production quality.

Same category · related evaluations