gpt-oss-120b
开源来自 OpenAI · 发布于 2025-08-05
46.9
平均分
$0.04/1M
输入价格
$0.19/1M
输出价格
131K tokens (~66 books)
上下文窗口
text
类型
Tested on 27 benchmarks with 46.9% average. Top scores: Chatbot Arena Elo — Overall (1353.8%), OTIS Mock AIME 2024-2025 (88.9%), HELM — WildBench (84.5%).
基准测试分数
| 基准测试 | 类别 | 分数 | Bar |
|---|---|---|---|
| Chatbot Arena Elo — Overall | arena | 1353.8 | |
| OTIS Mock AIME 2024-2025 | math | 88.9 | |
| HELM — WildBench | reasoning | 84.5 | |
| HELM — IFEval | language | 83.6 | |
| HELM — MMLU-Pro | knowledge | 79.5 | |
| Lech Mazur Writing | knowledge | 77.3 | |
| LiveBench — Mathematics | math | 68.9 | |
| HELM — Omni-MATH | math | 68.8 | |
| HELM — GPQA | knowledge | 68.4 | |
| GPQA diamond | knowledge | 67.7 | |
| LiveBench — Coding | coding | 60.2 | |
| LiveBench — If | language | 50.3 | |
| LiveBench — Language | language | 48.6 | |
| WeirdML | coding | 48.2 | |
| LiveBench — Overall | knowledge | 46.1 | |
| Fiction.LiveBench | knowledge | 44.4 | |
| Aider polyglot | coding | 41.8 | |
| LiveBench — Reasoning | reasoning | 39.2 | |
| LiveBench — Data Analysis | reasoning | 38.8 | |
| SWE-Bench Verified (Bash Only) | coding | 26.0 | |
| Chess Puzzles | knowledge | 20.0 | |
| Terminal Bench | coding | 18.7 | |
| LiveBench — Agentic Coding | coding | 16.7 | |
| SimpleQA Verified | knowledge | 13.9 | |
| Fortress | safety | 8.2 | |
| SimpleBench | reasoning | 6.5 | |
| APEX-Agents | agentic | 4.7 |