Home/Models/gpt-oss-120b
OpenAI logo

gpt-oss-120b

by OpenAI · Released Aug 2025

Open Source
43.7
avg score
Rank #130
Compare
Better than 44% of all models
Context
131K tokens (~66 books)
Input $/1M
$0.04
Output $/1M
$0.18
Type
text
License
Open Source
Benchmarks
27 tested
Data updated today
About

gpt-oss-120b is an open-weight, 117B-parameter Mixture-of-Experts (MoE) language model from OpenAI designed for high-reasoning, agentic, and general-purpose production use cases. It activates 5.1B parameters per forward pass and is optimized...

Tested on 27 benchmarks with 46.9% average. Top scores: Chatbot Arena Elo — Overall (1353.8%), OTIS Mock AIME 2024-2025 (88.9%), HELM — WildBench (84.5%).

Capabilities
coding
35.3
#108 globally
reasoning
42.3
#62 globally
math
75.5
#25 globally
knowledge
52.2
#88 globally
agentic
4.7
#33 globally
language
60.8
#85 globally
safety
8.2
#6 globally
Benchmark Scores
Compare All
Tested on 27 benchmarks · Ranked across 8 categories
Score Distribution (all 233 models)
0255075100
▲ You are here
LiveBench — Coding

Regularly refreshed coding problems that avoid data contamination. New problems added monthly to prevent memorization.

60.2
WeirdML

Unusual and adversarial machine learning challenges. Tests robustness of reasoning about edge cases in ML systems.

48.2
Aider polyglot

Multi-language code editing from Aider. Tests editing ability across Python, JavaScript, TypeScript, Java, C++, Go, Rust, and more.

41.8
HELM — WildBench

Stanford HELM WildBench evaluation. Tests reasoning on challenging real-world tasks.

84.5
LiveBench — Reasoning

Regularly refreshed reasoning problems testing logical deduction, spatial reasoning, and analytical thinking.

39.2
LiveBench — Data Analysis

Fresh data analysis tasks testing ability to interpret tables, charts, and statistical data.

38.8
OTIS Mock AIME 2024-2025

Mock AIME (American Invitational Mathematics Exam) problems from OTIS. Tests mathematical competition performance.

88.9
LiveBench — Mathematics

Regularly updated math problems that test numerical reasoning, algebra, calculus, and combinatorics.

68.9
HELM — Omni-MATH

Stanford HELM evaluation of mathematical reasoning across diverse problem types.

68.8
Excellent (85+) Good (70-85) Average (50-70) Below (<50)
Links
Documentation
Community
BenchGecko API
gpt-oss-120b
Specifications
  • Typetext
  • Context131K tokens (~66 books)
  • ReleasedAug 2025
  • LicenseOpen Source
  • StatusActive
  • Cost / Message~$0.000
Available On
OpenAI logoOpenAI$0.04
Share & Export
Tweet
gpt-oss-120b is an open-source text AI model by OpenAI, released in August 2025. It has an average benchmark score of 43.7. Context window: 131K tokens.