LIVETracking 994 AI models from 267 providers.

Gecko Tests·powered by GeckoBench · AI Bias, Censorship, IQ & Politics View Gecko Tests Build your own chart

Home/Models/Phi 4

Phi 4

by Microsoft · Released Jan 2025

Open Source

54.2

avg score

Rank #92

Better than 61% of all models

Context

16K tokens (~8 books)

Input $/1M

$0.07

Output $/1M

$0.14

Type

text

License

Open Source

Benchmarks

16 tested

Data updated today

About

[Microsoft Research](/microsoft) Phi-4 is designed to perform well in complex reasoning tasks and can operate efficiently in situations with limited memory or where quick responses are needed. At 14 billion...

Tested on 16 benchmarks with 43.2% average. Top scores: Chatbot Arena Elo — Overall (1255.4%), MMLU (79.7%), IFEval (68.8%).

Capabilities

reasoning

10.1

#134 globally

math

42.9

#99 globally

knowledge

42.6

#136 globally

speed

12.0

#62 globally

general

55.3

#6 globally

language

68.8

#72 globally

Benchmark Scores

Tested on 16 benchmarks · Ranked across 7 categories

Score Distribution (all 233 models)

0255075100

▲ You are here

reasoningCompare reasoning →

HuggingFace MuSR (Multi-Step Reasoning). Tests multi-hop reasoning requiring chaining multiple facts together.

10.1—

mathCompare math →

Competition-level math from AMC, AIME, and olympiad problems. Level 5 is the hardest tier, requiring creative problem-solving.

64.9—

HuggingFace evaluation of MATH Level 5 problems. Competition math requiring advanced reasoning and proof construction.

50.0—

OTIS Mock AIME 2024-2025

Mock AIME (American Invitational Mathematics Exam) problems from OTIS. Tests mathematical competition performance.

13.7—

knowledgeCompare knowledge →

Massive Multitask Language Understanding. 57 subjects from STEM, humanities, and social sciences. The most widely-cited knowledge benchmark.

79.7—

Lech Mazur Writing

Writing quality evaluation by Lech Mazur. Tests prose quality, coherence, and stylistic ability.

62.6—

HuggingFace MMLU-Pro. Harder version of MMLU with 10 answer choices instead of 4 and more challenging questions.

48.6—

Quick compare:

vs Llama 3.1 70B Instruct

vs DeepSeek V3.1

Excellent (85+) Good (70-85) Average (50-70) Below (<50)

Model Family · Microsoft Phi 4

$0.07/M in16Kctx16 benchmarks

Phi 4 Mini InstructFeb 2025

N/AN/Actx9 benchmarks

Phi 4 Multimodal InstructFeb 2025

N/AN/Actx1 benchmark

See the full Phi 4 family →

Similar Models

Llama 3.1 70B Instruct

Links

Info

Microsoft Pricing explorer Developers · API

Research

Technical Report

Documentation

API Docs Playground

Community

Source Code

GitHub Hugging Face

BenchGecko API

phi-4

Specifications

Typetext
Context16K tokens (~8 books)
ReleasedJan 2025
LicenseOpen Source
StatusActive
Cost / Message~$0.000

Available On

Microsoft$0.07

Categories

reasoning math knowledge speed general language

Learn More

transformer open weights tokens

Share & Export

Related Models

Llama 3.1 70B Instruct

LongCat Flash Chat

Frequently Asked Questions

Phi 4 is an open-source text AI model by Microsoft, released in January 2025. It has an average benchmark score of 54.2. Context window: 16K tokens.

Related Models

GLM 4.6 · z-ai Llama 3.1 70B Instruct · Meta DeepSeek V3.1 · DeepSeek LongCat Flash Chat · meituan DeepSeek V3 0324 · DeepSeek

Benchmarks

Chatbot Arena Elo — Overall MMLU IFEval MATH level 5 Lech Mazur Writing

Related Pages

Microsoft · Provider Microsoft · Economy All Models Compare Models Pricing Developers · API