How much does o3 cost?

o3 costs $2.00 per million input tokens and $8.00 per million output tokens. For a typical conversation (~2,000 tokens), that's approximately $0.012 per message.

What benchmarks has o3 been tested on?

o3 has been evaluated on 33 benchmarks. Top scores: MATH level 5: 97.8, Fiction.LiveBench: 88.9, HELM — IFEval: 86.9.

No, o3 is a proprietary model by OpenAI.

How does o3 compare to DeepSeek V3?

o3 has an average score of 58.3 while DeepSeek V3 scores 58.3. DeepSeek V3 slightly outperforms o3 overall. o3 costs $2.00/1M input vs DeepSeek V3 at $0.32/1M input. See full comparison →

Home/Models/o3

o3

Name: o3
Price: 2 USD
Author: OpenAI

by OpenAI · Released Apr 2025

Multimodal

58.3

avg score

Rank #77

Compare

Better than 67% of all models

Context

200K tokens (~100 books)

Input $/1M

$2.00

Output $/1M

$8.00

Type

multimodal

License

Proprietary

Benchmarks

33 tested

Data updated today

About

o3 is a well-rounded and powerful model across domains. It sets a new standard for math, science, coding, and visual reasoning tasks. It also excels at technical writing and instruction-following....

Tested on 33 benchmarks with 55.2% average. Top scores: MATH level 5 (97.8%), Fiction.LiveBench (88.9%), HELM — IFEval (86.9%).

Looking for similar performance at lower cost?
Qwen3 Next 80B A3B Thinking scores 57.5 (99% as good) at $0.10/1M input · 95% cheaper

Capabilities

coding

56.2

#49 globally

reasoning

49.3

#48 globally

math

54.8

#64 globally

knowledge

56.7

#70 globally

agentic

23.0

#21 globally

language

86.9

#25 globally

speed

62.7

#28 globally

Benchmark Scores

Compare All

Tested on 33 benchmarks · Ranked across 7 categories

Score Distribution (all 233 models)

0255075100

▲ You are here

codingCompare coding →

Aider polyglot

Multi-language code editing from Aider. Tests editing ability across Python, JavaScript, TypeScript, Java, C++, Go, Rust, and more.

81.3—

CadEval

Computer-aided design evaluation. Tests understanding of CAD concepts, 3D modeling, and engineering design principles.

74.0—

SWE-Bench verified

Real-world software engineering tasks from GitHub issues. Models must diagnose bugs and write patches that pass test suites. Human-verified subset of SWE-bench.

62.3—

reasoningCompare reasoning →

HELM — WildBench

Stanford HELM WildBench evaluation. Tests reasoning on challenging real-world tasks.

86.1—

ARC-AGI

Abstraction and Reasoning Corpus. Tests fluid intelligence through novel visual pattern recognition puzzles. Core measure of general intelligence.

60.8—

SimpleBench

Deceptively simple questions that humans find easy but AI models often get wrong. Tests common sense and reasoning gaps.

43.7—

mathCompare math →