[Microsoft Research](/microsoft) Phi-4 is designed to perform well in complex reasoning tasks and can operate efficiently in situations with limited memory or where quick responses are needed. At 14 billion...
Tested on 16 benchmarks with 43.2% average. Top scores: Chatbot Arena Elo — Overall (1255.4%), MMLU (79.7%), IFEval (68.8%).
HuggingFace MuSR (Multi-Step Reasoning). Tests multi-hop reasoning requiring chaining multiple facts together.
Competition-level math from AMC, AIME, and olympiad problems. Level 5 is the hardest tier, requiring creative problem-solving.
HuggingFace evaluation of MATH Level 5 problems. Competition math requiring advanced reasoning and proof construction.
Mock AIME (American Invitational Mathematics Exam) problems from OTIS. Tests mathematical competition performance.
Massive Multitask Language Understanding. 57 subjects from STEM, humanities, and social sciences. The most widely-cited knowledge benchmark.
Writing quality evaluation by Lech Mazur. Tests prose quality, coherence, and stylistic ability.
HuggingFace MMLU-Pro. Harder version of MMLU with 10 answer choices instead of 4 and more challenging questions.
- Typetext
- Context16K tokens (~8 books)
- ReleasedJan 2025
- LicenseOpen Source
- StatusActive
- Cost / Message~$0.000