Compare · ModelsLive · 2 picked · head to head

Claude 3.5 Sonnet vs gpt-oss-120b

Side by side · benchmarks, pricing, and signals you can act on.

CiteAdd another

Winner summary

gpt-oss-120b wins on 7/13 benchmarks

gpt-oss-120b wins 7 of 13 shared benchmarks. Leads in knowledge · math · reasoning.

Category leads

coding·Claude 3.5 Sonnetarena·Claude 3.5 Sonnetknowledge·gpt-oss-120blanguage·Claude 3.5 Sonnetmath·gpt-oss-120breasoning·gpt-oss-120bsafety·Claude 3.5 Sonnet

Hype vs Reality

Attention vs performance

Claude 3.5 Sonnet

#129 by perf·no signal

QUIET

gpt-oss-120b

#108 by perf·no signal

QUIET

See full mindshare →

Best value

gpt-oss-120b

Claude 3.5 Sonnet

—

no price

gpt-oss-120b

428.3 pts/$

$0.11/M

Explore pricing →

Vendor risk

Who is behind the model

Anthropic

$380.0B·Tier 1

Medium risk

OpenAI

$840.0B·Tier 1

Medium risk

See the AI economy →

Head to head

13 benchmarks · 2 models

Claude 3.5 Sonnetgpt-oss-120b

Aider polyglot

Claude 3.5 Sonnet leads by +9.8

Aider Polyglot · measures how well AI models can edit code across multiple programming languages using the Aider coding assistant framework.

Claude 3.5 Sonnet

51.6

gpt-oss-120b

41.8

Chatbot Arena Elo · Overall

Claude 3.5 Sonnet leads by +17.5

Claude 3.5 Sonnet

1371.4

gpt-oss-120b

1353.8

GPQA diamond

gpt-oss-120b leads by +29.0

Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.

Claude 3.5 Sonnet

38.7

gpt-oss-120b

67.7

HELM · GPQA

gpt-oss-120b leads by +11.9

Claude 3.5 Sonnet

56.5

gpt-oss-120b

68.4

HELM · IFEval

Claude 3.5 Sonnet leads by +2.0

Claude 3.5 Sonnet

85.6

gpt-oss-120b

83.6

HELM · MMLU-Pro

gpt-oss-120b leads by +1.8

Claude 3.5 Sonnet

77.7

gpt-oss-120b

79.5

HELM · Omni-MATH

gpt-oss-120b leads by +41.2

Claude 3.5 Sonnet

27.6

gpt-oss-120b

68.8

HELM · WildBench

gpt-oss-120b leads by +5.3

Claude 3.5 Sonnet

79.2

gpt-oss-120b

84.5

Lech Mazur Writing

Claude 3.5 Sonnet leads by +3.0

Lech Mazur Writing · evaluates creative writing ability, assessing prose quality, narrative coherence, and stylistic sophistication.

Claude 3.5 Sonnet

80.3

gpt-oss-120b

77.3

OTIS Mock AIME 2024-2025

gpt-oss-120b leads by +82.4

OTIS Mock AIME 2024-2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.

Claude 3.5 Sonnet

6.4

gpt-oss-120b

88.9

Fortress

Claude 3.5 Sonnet leads by +4.7

Claude 3.5 Sonnet

13.0

gpt-oss-120b

8.2

SimpleBench

Claude 3.5 Sonnet leads by +6.5

SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.

Claude 3.5 Sonnet

13.0

gpt-oss-120b

6.5

WeirdML

gpt-oss-120b leads by +17.2

WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns.

Claude 3.5 Sonnet

31.0

gpt-oss-120b

48.2

Full benchmark table

Benchmark	Claude 3.5 Sonnet	gpt-oss-120b
Aider polyglot Aider Polyglot · measures how well AI models can edit code across multiple programming languages using the Aider coding assistant framework.	51.6	41.8
Chatbot Arena Elo · Overall	1371.4	1353.8
GPQA diamond Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.	38.7	67.7
HELM · GPQA	56.5	68.4
HELM · IFEval	85.6	83.6
HELM · MMLU-Pro	77.7	79.5
HELM · Omni-MATH	27.6	68.8
HELM · WildBench	79.2	84.5
Lech Mazur Writing Lech Mazur Writing · evaluates creative writing ability, assessing prose quality, narrative coherence, and stylistic sophistication.	80.3	77.3
OTIS Mock AIME 2024-2025 OTIS Mock AIME 2024-2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.	6.4	88.9
Fortress	13.0	8.2
SimpleBench SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.	13.0	6.5
WeirdML WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns.	31.0	48.2

Pricing · per 1M tokens · projected $/mo at 10M tokens

Model	Input	Output	Context	Projected $/mo
Claude 3.5 Sonnet	—	—	—	—
gpt-oss-120b	$0.04	$0.18	131K tokens (~66 books)	$0.74