Compare · ModelsLive · 2 picked · head to head
o3 Mini vs gpt-oss-120b
Side by side · benchmarks, pricing, and signals you can act on.
Winner summary
gpt-oss-120b wins on 5/9 benchmarks
gpt-oss-120b wins 5 of 9 shared benchmarks. Leads in arena · knowledge · math.
Category leads
coding·o3 Miniarena·gpt-oss-120bknowledge·gpt-oss-120bmath·gpt-oss-120breasoning·o3 Mini
Hype vs Reality
Attention vs performance
o3 Mini
#149 by perf·no signal
gpt-oss-120b
#106 by perf·no signal
Best value
gpt-oss-120b
29.3x better value than o3 Mini
o3 Mini
14.0 pts/$
$2.75/M
gpt-oss-120b
409.6 pts/$
$0.11/M
Vendor risk
Who is behind the model
OpenAI
$840.0B·Tier 1
OpenAI
$840.0B·Tier 1
Head to head
9 benchmarks · 2 models
o3 Minigpt-oss-120b
Aider polyglot
o3 Mini leads by +18.6
Aider Polyglot · measures how well AI models can edit code across multiple programming languages using the Aider coding assistant framework.
o3 Mini
60.4
gpt-oss-120b
41.8
Chatbot Arena Elo · Overall
gpt-oss-120b leads by +6.4
o3 Mini
1347.5
gpt-oss-120b
1353.8
Chess Puzzles
gpt-oss-120b leads by +3.0
Chess Puzzles · tests strategic and tactical reasoning by having models solve chess puzzle positions, evaluating lookahead and pattern recognition abilities.
o3 Mini
17.0
gpt-oss-120b
20.0
Fiction.LiveBench
o3 Mini leads by +5.6
Fiction.LiveBench · a continuously updated benchmark using recently published fiction to test reading comprehension and reasoning, preventing data contamination.
o3 Mini
50.0
gpt-oss-120b
44.4
GPQA diamond
o3 Mini leads by +1.7
Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.
o3 Mini
69.4
gpt-oss-120b
67.7
Lech Mazur Writing
gpt-oss-120b leads by +15.6
Lech Mazur Writing · evaluates creative writing ability, assessing prose quality, narrative coherence, and stylistic sophistication.
o3 Mini
61.7
gpt-oss-120b
77.3
OTIS Mock AIME 2024-2025
gpt-oss-120b leads by +12.0
OTIS Mock AIME 2024–2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.
o3 Mini
76.9
gpt-oss-120b
88.9
SimpleBench
o3 Mini leads by +0.8
SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.
o3 Mini
7.4
gpt-oss-120b
6.5
WeirdML
gpt-oss-120b leads by +4.5
WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns.
o3 Mini
43.7
gpt-oss-120b
48.2
Full benchmark table
| Benchmark | o3 Mini | gpt-oss-120b |
|---|---|---|
Aider polyglot Aider Polyglot · measures how well AI models can edit code across multiple programming languages using the Aider coding assistant framework. | 60.4 | 41.8 |
Chatbot Arena Elo · Overall | 1347.5 | 1353.8 |
Chess Puzzles Chess Puzzles · tests strategic and tactical reasoning by having models solve chess puzzle positions, evaluating lookahead and pattern recognition abilities. | 17.0 | 20.0 |
Fiction.LiveBench Fiction.LiveBench · a continuously updated benchmark using recently published fiction to test reading comprehension and reasoning, preventing data contamination. | 50.0 | 44.4 |
GPQA diamond Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs. | 69.4 | 67.7 |
Lech Mazur Writing Lech Mazur Writing · evaluates creative writing ability, assessing prose quality, narrative coherence, and stylistic sophistication. | 61.7 | 77.3 |
OTIS Mock AIME 2024-2025 OTIS Mock AIME 2024–2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills. | 76.9 | 88.9 |
SimpleBench SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking. | 7.4 | 6.5 |
WeirdML WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns. | 43.7 | 48.2 |
Pricing · per 1M tokens · projected $/mo at 10M tokens
| Model | Input | Output | Context | Projected $/mo |
|---|---|---|---|---|
| $1.10 | $4.40 | 200K tokens (~100 books) | $19.25 | |
| $0.04 | $0.19 | 131K tokens (~66 books) | $0.77 |
People also compared