How much does GPT-4.1 cost?

GPT-4.1 costs $2.00 per million input tokens and $8.00 per million output tokens. For a typical conversation (~2,000 tokens), that's approximately $0.012 per message.

What benchmarks has GPT-4.1 been tested on?

GPT-4.1 has been evaluated on 22 benchmarks. Top scores: HELM — WildBench: 85.4, HELM — IFEval: 83.8, MATH level 5: 83.0.

Is GPT-4.1 open source?

No, GPT-4.1 is a proprietary model by OpenAI.

How does GPT-4.1 compare to Mistral Medium 3?

GPT-4.1 has an average score of 41.2 while Mistral Medium 3 scores 41.1. GPT-4.1 outperforms Mistral Medium 3 overall. GPT-4.1 costs $2.00/1M input vs Mistral Medium 3 at $0.40/1M input. See full comparison →

Home/Models/GPT-4.1

GPT-4.1

Name: GPT-4.1
Price: 2 USD
Author: OpenAI

by OpenAI · Released Apr 2025

Multimodal1M Context

41.2

avg score

Rank #140

Compare

Better than 40% of all models

Context

1.0M tokens (~524 books)

Input $/1M

$2.00

Output $/1M

$8.00

Type

multimodal

License

Proprietary

Benchmarks

22 tested

Data updated today

About

GPT-4.1 is a flagship large language model optimized for advanced instruction following, real-world software engineering, and long-context reasoning. It supports a 1 million token context window and outperforms GPT-4o and...

Tested on 22 benchmarks with 43.3% average. Top scores: HELM — WildBench (85.4%), HELM — IFEval (83.8%), MATH level 5 (83.0%).

Looking for similar performance at lower cost?
Llama 3 8B Instruct scores 41.7 (101% as good) at $0.03/1M input · 99% cheaper

Capabilities

coding

44.3

#88 globally

reasoning

25.9

#95 globally

math

34.8

#120 globally

knowledge

52.7

#86 globally

language

83.8

#37 globally

Benchmark Scores

Compare All

Tested on 22 benchmarks · Ranked across 5 categories

Score Distribution (all 233 models)

0255075100

▲ You are here

codingCompare coding →

Aider polyglot

Multi-language code editing from Aider. Tests editing ability across Python, JavaScript, TypeScript, Java, C++, Go, Rust, and more.

52.4—

SWE-Bench verified

Real-world software engineering tasks from GitHub issues. Models must diagnose bugs and write patches that pass test suites. Human-verified subset of SWE-bench.

48.5—

CadEval

Computer-aided design evaluation. Tests understanding of CAD concepts, 3D modeling, and engineering design principles.

42.0—

reasoningCompare reasoning →

HELM — WildBench

Stanford HELM WildBench evaluation. Tests reasoning on challenging real-world tasks.

85.4—

SimpleBench

Deceptively simple questions that humans find easy but AI models often get wrong. Tests common sense and reasoning gaps.

12.4—

ARC-AGI