Compare · ModelsLive · 2 picked · head to head

o1 vs Claude 3.7 Sonnet

Side by side · benchmarks, pricing, and signals you can act on.

CiteAdd another

Winner summary

o1 wins on 7/12 benchmarks

o1 wins 7 of 12 shared benchmarks. Leads in reasoning · math.

7 / 12

Claude 3.7 Sonnet

5 / 12

Category leads

coding·Claude 3.7 Sonnetreasoning·o1knowledge·Claude 3.7 Sonnetmath·o1

Hype vs Reality

Attention vs performance

#59 by perf·no signal

QUIET

Claude 3.7 Sonnet

#103 by perf·no signal

QUIET

See full mindshare →

Best value

Claude 3.7 Sonnet

3.5x better value than o1

1.5 pts/$

$37.50/M

Claude 3.7 Sonnet

5.3 pts/$

$9.00/M

Explore pricing →

Vendor risk

Who is behind the model

OpenAI

$840.0B·Tier 1

Medium risk

Anthropic

$380.0B·Tier 1

Medium risk

See the AI economy →

Head to head

12 benchmarks · 2 models

o1Claude 3.7 Sonnet

Aider polyglot

Claude 3.7 Sonnet leads by +3.2

Aider Polyglot · measures how well AI models can edit code across multiple programming languages using the Aider coding assistant framework.

61.7

Claude 3.7 Sonnet

64.9

ARC-AGI

o1 leads by +2.1

ARC-AGI · the original Abstraction and Reasoning Corpus, testing whether AI can solve novel visual pattern recognition tasks without memorization.

30.7

Claude 3.7 Sonnet

28.6

CadEval

o1 leads by +2.0

CadEval · evaluates the ability to generate and reason about Computer-Aided Design code, testing spatial reasoning and engineering knowledge.

56.0

Claude 3.7 Sonnet

54.0

Fiction.LiveBench

Fiction.LiveBench · a continuously updated benchmark using recently published fiction to test reading comprehension and reasoning, preventing data contamination.

83.3

Claude 3.7 Sonnet

83.3

FrontierMath-2025-02-28-Private

o1 leads by +5.2

FrontierMath (Feb 2025) · original research-level math problems created by mathematicians, testing capabilities at the boundary of current AI mathematical reasoning.

9.3

Claude 3.7 Sonnet

4.1

GeoBench

o1 leads by +12.0

GeoBench · tests geographic knowledge and spatial reasoning across countries, landmarks, coordinates, and geopolitical understanding.

80.0

Claude 3.7 Sonnet

68.0

GPQA diamond

Claude 3.7 Sonnet leads by +4.0

Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.

69.0

Claude 3.7 Sonnet

73.0

Lech Mazur Writing

Claude 3.7 Sonnet leads by +10.9

Lech Mazur Writing · evaluates creative writing ability, assessing prose quality, narrative coherence, and stylistic sophistication.

70.2

Claude 3.7 Sonnet

81.1

MATH level 5

o1 leads by +3.5

MATH Level 5 · the hardest tier of the MATH benchmark, featuring competition-level problems from AMC, AIME, and Olympiad-style mathematics.

94.7

Claude 3.7 Sonnet

91.2

OTIS Mock AIME 2024-2025

o1 leads by +15.6

OTIS Mock AIME 2024-2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.

73.3

Claude 3.7 Sonnet

57.7

SimpleBench

Claude 3.7 Sonnet leads by +7.6

SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.

28.1

Claude 3.7 Sonnet

35.7

VPCT

Claude 3.7 Sonnet leads by +3.0

VPCT (Visual Pattern Completion Test) · tests visual reasoning and pattern recognition by having models complete visual sequences and transformations.

5.5

Claude 3.7 Sonnet

8.5

Full benchmark table

Benchmark	o1	Claude 3.7 Sonnet
Aider polyglot Aider Polyglot · measures how well AI models can edit code across multiple programming languages using the Aider coding assistant framework.	61.7	64.9
ARC-AGI ARC-AGI · the original Abstraction and Reasoning Corpus, testing whether AI can solve novel visual pattern recognition tasks without memorization.	30.7	28.6
CadEval CadEval · evaluates the ability to generate and reason about Computer-Aided Design code, testing spatial reasoning and engineering knowledge.	56.0	54.0
Fiction.LiveBench Fiction.LiveBench · a continuously updated benchmark using recently published fiction to test reading comprehension and reasoning, preventing data contamination.	83.3	83.3
FrontierMath-2025-02-28-Private FrontierMath (Feb 2025) · original research-level math problems created by mathematicians, testing capabilities at the boundary of current AI mathematical reasoning.	9.3	4.1
GeoBench GeoBench · tests geographic knowledge and spatial reasoning across countries, landmarks, coordinates, and geopolitical understanding.	80.0	68.0
GPQA diamond Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.	69.0	73.0
Lech Mazur Writing Lech Mazur Writing · evaluates creative writing ability, assessing prose quality, narrative coherence, and stylistic sophistication.	70.2	81.1
MATH level 5 MATH Level 5 · the hardest tier of the MATH benchmark, featuring competition-level problems from AMC, AIME, and Olympiad-style mathematics.	94.7	91.2
OTIS Mock AIME 2024-2025 OTIS Mock AIME 2024-2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.	73.3	57.7
SimpleBench SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.	28.1	35.7
VPCT VPCT (Visual Pattern Completion Test) · tests visual reasoning and pattern recognition by having models complete visual sequences and transformations.	5.5	8.5

Pricing · per 1M tokens · projected $/mo at 10M tokens

Model	Input	Output	Context	Projected $/mo
o1	$15.00	$60.00	200K tokens (~100 books)	$262.50
Claude 3.7 Sonnet	$3.00	$15.00	200K tokens (~100 books)	$60.00

People also compared

o1 vs o3 R1 vs o1 Claude 3.7 Sonnet vs GPT-4o Claude 3.7 Sonnet vs Claude Mythos Preview Claude 3.7 Sonnet vs Claude Instant Claude 3.7 Sonnet vs Claude Opus 4.6 Claude 3.7 Sonnet vs Claude Sonnet 4.6