gpt-oss-120b is an open-weight, 117B-parameter Mixture-of-Experts (MoE) language model from OpenAI designed for high-reasoning, agentic, and general-purpose production use cases. It activates 5.1B parameters per forward pass and is optimized...
Tested on 27 benchmarks with 46.9% average. Top scores: Chatbot Arena Elo — Overall (1353.8%), OTIS Mock AIME 2024-2025 (88.9%), HELM — WildBench (84.5%).
Regularly refreshed coding problems that avoid data contamination. New problems added monthly to prevent memorization.
Unusual and adversarial machine learning challenges. Tests robustness of reasoning about edge cases in ML systems.
Multi-language code editing from Aider. Tests editing ability across Python, JavaScript, TypeScript, Java, C++, Go, Rust, and more.
Stanford HELM WildBench evaluation. Tests reasoning on challenging real-world tasks.
Regularly refreshed reasoning problems testing logical deduction, spatial reasoning, and analytical thinking.
Fresh data analysis tasks testing ability to interpret tables, charts, and statistical data.
Mock AIME (American Invitational Mathematics Exam) problems from OTIS. Tests mathematical competition performance.
Regularly updated math problems that test numerical reasoning, algebra, calculus, and combinatorics.
Stanford HELM evaluation of mathematical reasoning across diverse problem types.
- Typetext
- Context131K tokens (~66 books)
- ReleasedAug 2025
- LicenseOpen Source
- StatusActive
- Cost / Message~$0.000