Beta
Compare · ModelsLive · 2 picked · head to head

Claude 3.5 Sonnet vs gpt-oss-120b

Side by side · benchmarks, pricing, and signals you can act on.

Winner summary

gpt-oss-120b wins 7 of 13 shared benchmarks. Leads in knowledge · math · reasoning.

Category leads
coding·Claude 3.5 Sonnetarena·Claude 3.5 Sonnetknowledge·gpt-oss-120blanguage·Claude 3.5 Sonnetmath·gpt-oss-120breasoning·gpt-oss-120bsafety·Claude 3.5 Sonnet
Hype vs Reality
Claude 3.5 Sonnet
#127 by perf·no signal
QUIET
gpt-oss-120b
#106 by perf·no signal
QUIET
Best value
Claude 3.5 Sonnet
no price
gpt-oss-120b
409.6 pts/$
$0.11/M
Vendor risk
Anthropic logo
Anthropic
$380.0B·Tier 1
Medium risk
OpenAI logo
OpenAI
$840.0B·Tier 1
Medium risk
Head to head
Claude 3.5 Sonnetgpt-oss-120b
Aider polyglot
Claude 3.5 Sonnet leads by +9.8
Aider Polyglot · measures how well AI models can edit code across multiple programming languages using the Aider coding assistant framework.
Claude 3.5 Sonnet
51.6
gpt-oss-120b
41.8
Chatbot Arena Elo · Overall
Claude 3.5 Sonnet leads by +17.5
Claude 3.5 Sonnet
1371.4
gpt-oss-120b
1353.8
GPQA diamond
gpt-oss-120b leads by +29.0
Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.
Claude 3.5 Sonnet
38.7
gpt-oss-120b
67.7
HELM · GPQA
gpt-oss-120b leads by +11.9
Claude 3.5 Sonnet
56.5
gpt-oss-120b
68.4
HELM · IFEval
Claude 3.5 Sonnet leads by +2.0
Claude 3.5 Sonnet
85.6
gpt-oss-120b
83.6
HELM · MMLU-Pro
gpt-oss-120b leads by +1.8
Claude 3.5 Sonnet
77.7
gpt-oss-120b
79.5
HELM · Omni-MATH
gpt-oss-120b leads by +41.2
Claude 3.5 Sonnet
27.6
gpt-oss-120b
68.8
HELM · WildBench
gpt-oss-120b leads by +5.3
Claude 3.5 Sonnet
79.2
gpt-oss-120b
84.5
Lech Mazur Writing
Claude 3.5 Sonnet leads by +3.0
Lech Mazur Writing · evaluates creative writing ability, assessing prose quality, narrative coherence, and stylistic sophistication.
Claude 3.5 Sonnet
80.3
gpt-oss-120b
77.3
OTIS Mock AIME 2024-2025
gpt-oss-120b leads by +82.4
OTIS Mock AIME 2024–2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.
Claude 3.5 Sonnet
6.4
gpt-oss-120b
88.9
Fortress
Claude 3.5 Sonnet leads by +4.7
Claude 3.5 Sonnet
13.0
gpt-oss-120b
8.2
SimpleBench
Claude 3.5 Sonnet leads by +6.5
SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.
Claude 3.5 Sonnet
13.0
gpt-oss-120b
6.5
WeirdML
gpt-oss-120b leads by +17.2
WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns.
Claude 3.5 Sonnet
31.0
gpt-oss-120b
48.2
Full benchmark table
BenchmarkClaude 3.5 Sonnetgpt-oss-120b
Aider polyglot
Aider Polyglot · measures how well AI models can edit code across multiple programming languages using the Aider coding assistant framework.
51.641.8
Chatbot Arena Elo · Overall
1371.41353.8
GPQA diamond
Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.
38.767.7
HELM · GPQA
56.568.4
HELM · IFEval
85.683.6
HELM · MMLU-Pro
77.779.5
HELM · Omni-MATH
27.668.8
HELM · WildBench
79.284.5
Lech Mazur Writing
Lech Mazur Writing · evaluates creative writing ability, assessing prose quality, narrative coherence, and stylistic sophistication.
80.377.3
OTIS Mock AIME 2024-2025
OTIS Mock AIME 2024–2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.
6.488.9
Fortress
13.08.2
SimpleBench
SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.
13.06.5
WeirdML
WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns.
31.048.2
Pricing · per 1M tokens · projected $/mo at 10M tokens
ModelInputOutputContextProjected $/mo
Anthropic logoClaude 3.5 Sonnet
OpenAI logogpt-oss-120b$0.04$0.19131K tokens (~66 books)$0.77