How much does gpt-oss-120b cost?

gpt-oss-120b costs $0.04 per million input tokens and $0.18 per million output tokens. For a typical conversation (~2,000 tokens), that's approximately $0.000 per message.

What benchmarks has gpt-oss-120b been tested on?

gpt-oss-120b has been evaluated on 27 benchmarks. Top scores: Chatbot Arena Elo — Overall: 1353.8, OTIS Mock AIME 2024-2025: 88.9, HELM — WildBench: 84.5.

Is gpt-oss-120b open source?

Yes, gpt-oss-120b is open source.

How does gpt-oss-120b compare to Claude Sonnet 4?

gpt-oss-120b has an average score of 43.7 while Claude Sonnet 4 scores 43.7. Claude Sonnet 4 slightly outperforms gpt-oss-120b overall. gpt-oss-120b costs $0.04/1M input vs Claude Sonnet 4 at $3.00/1M input. See full comparison →

Home/Models/gpt-oss-120b

gpt-oss-120b

Name: gpt-oss-120b
Price: 0.039 USD
Author: OpenAI

by OpenAI · Released Aug 2025

Open Source

43.7

avg score

Rank #130

Compare

Better than 44% of all models

Context

131K tokens (~66 books)

Input $/1M

$0.04

Output $/1M

$0.18

Type

text

License

Open Source

Benchmarks

27 tested

Data updated today

About

gpt-oss-120b is an open-weight, 117B-parameter Mixture-of-Experts (MoE) language model from OpenAI designed for high-reasoning, agentic, and general-purpose production use cases. It activates 5.1B parameters per forward pass and is optimized...

Tested on 27 benchmarks with 46.9% average. Top scores: Chatbot Arena Elo — Overall (1353.8%), OTIS Mock AIME 2024-2025 (88.9%), HELM — WildBench (84.5%).

Capabilities

coding

35.3

#108 globally

reasoning

42.3

#62 globally

math

75.5

#25 globally

knowledge

52.2

#88 globally

agentic

4.7

#33 globally

language

60.8

#85 globally

safety

8.2

#6 globally

Benchmark Scores

Compare All

Tested on 27 benchmarks · Ranked across 8 categories

Score Distribution (all 233 models)

0255075100

▲ You are here

codingCompare coding →

LiveBench — Coding

Regularly refreshed coding problems that avoid data contamination. New problems added monthly to prevent memorization.

60.2—

WeirdML

Unusual and adversarial machine learning challenges. Tests robustness of reasoning about edge cases in ML systems.

48.2—

Aider polyglot

Multi-language code editing from Aider. Tests editing ability across Python, JavaScript, TypeScript, Java, C++, Go, Rust, and more.

41.8—

reasoningCompare reasoning →

HELM — WildBench

Stanford HELM WildBench evaluation. Tests reasoning on challenging real-world tasks.

84.5—

LiveBench — Reasoning

Regularly refreshed reasoning problems testing logical deduction, spatial reasoning, and analytical thinking.

39.2—

LiveBench — Data Analysis

Fresh data analysis tasks testing ability to interpret tables, charts, and statistical data.

38.8—

mathCompare math →

OTIS Mock AIME 2024-2025

Mock AIME (American Invitational Mathematics Exam) problems from OTIS. Tests mathematical competition performance.

88.9—

LiveBench — Mathematics

Regularly updated math problems that test numerical reasoning, algebra, calculus, and combinatorics.

68.9—

HELM — Omni-MATH

Stanford HELM evaluation of mathematical reasoning across diverse problem types.

68.8—

Quick compare:

vs Claude Sonnet 4

vs Llama 3.1 405B

vs Gemma 2 2b It

Excellent (85+) Good (70-85) Average (50-70) Below (<50)

Similar Models

Frequently Asked Questions

gpt-oss-120b is an open-source text AI model by OpenAI, released in August 2025. It has an average benchmark score of 43.7. Context window: 131K tokens.

Benchmarks

Chatbot Arena Elo — Overall OTIS Mock AIME 2024-2025 HELM — WildBench HELM — IFEval HELM — MMLU-Pro

OpenAI · Provider OpenAI · Economy All Models Compare Models Pricing Developers · API Context Window · Glossary

gpt-oss-120b

Frequently Asked Questions

Related Models

Benchmarks

Related Pages