Home/Models/GPT-4.1
OpenAI logo

GPT-4.1

by OpenAI · Released Apr 2025

Multimodal1M Context
41.2
avg score
Rank #140
Compare
Better than 40% of all models
Context
1.0M tokens (~524 books)
Input $/1M
$2.00
Output $/1M
$8.00
Type
multimodal
License
Proprietary
Benchmarks
22 tested
Data updated today
About

GPT-4.1 is a flagship large language model optimized for advanced instruction following, real-world software engineering, and long-context reasoning. It supports a 1 million token context window and outperforms GPT-4o and...

Tested on 22 benchmarks with 43.3% average. Top scores: HELM — WildBench (85.4%), HELM — IFEval (83.8%), MATH level 5 (83.0%).

Looking for similar performance at lower cost?
Llama 3 8B Instruct scores 41.7 (101% as good) at $0.03/1M input · 99% cheaper
Capabilities
coding
44.3
#88 globally
reasoning
25.9
#95 globally
math
34.8
#120 globally
knowledge
52.7
#86 globally
language
83.8
#37 globally
Benchmark Scores
Compare All
Tested on 22 benchmarks · Ranked across 5 categories
Score Distribution (all 233 models)
0255075100
▲ You are here
Aider polyglot

Multi-language code editing from Aider. Tests editing ability across Python, JavaScript, TypeScript, Java, C++, Go, Rust, and more.

52.4
SWE-Bench verified

Real-world software engineering tasks from GitHub issues. Models must diagnose bugs and write patches that pass test suites. Human-verified subset of SWE-bench.

48.5
CadEval

Computer-aided design evaluation. Tests understanding of CAD concepts, 3D modeling, and engineering design principles.

42.0
HELM — WildBench

Stanford HELM WildBench evaluation. Tests reasoning on challenging real-world tasks.

85.4
SimpleBench

Deceptively simple questions that humans find easy but AI models often get wrong. Tests common sense and reasoning gaps.

12.4
ARC-AGI

Abstraction and Reasoning Corpus. Tests fluid intelligence through novel visual pattern recognition puzzles. Core measure of general intelligence.

5.5
MATH level 5

Competition-level math from AMC, AIME, and olympiad problems. Level 5 is the hardest tier, requiring creative problem-solving.

83.0
HELM — Omni-MATH

Stanford HELM evaluation of mathematical reasoning across diverse problem types.

47.1
OTIS Mock AIME 2024-2025

Mock AIME (American Invitational Mathematics Exam) problems from OTIS. Tests mathematical competition performance.

38.3
Excellent (85+) Good (70-85) Average (50-70) Below (<50)
Links
Documentation
Community
BenchGecko API
gpt-4-1
Specifications
  • Typemultimodal
  • Context1.0M tokens (~524 books)
  • ReleasedApr 2025
  • LicenseProprietary
  • StatusActive
  • Cost / Message~$0.012
Available On
OpenAI logoOpenAI$2.00
Share & Export
Tweet
GPT-4.1 is a proprietary multimodal AI model by OpenAI, released in April 2025. It has an average benchmark score of 41.2. Context window: 1M tokens.