Qwen2.5 72B is the latest series of Qwen large language models. Qwen2.5 brings the following improvements upon Qwen2: - Significantly more knowledge and has greatly improved capabilities in coding and...
Tested on 24 benchmarks with 53.2% average. Top scores: Chatbot Arena Elo — Overall (1302.3%), ARC AI2 (92.7%), IFEval (86.4%).
Code editing benchmark from the Aider project. Measures ability to apply targeted code changes while maintaining correctness and style.
BIG-Bench Hard. 23 challenging tasks from BIG-Bench where prior language models fell below average human performance.
HuggingFace MuSR (Multi-Step Reasoning). Tests multi-hop reasoning requiring chaining multiple facts together.
Competition-level math from AMC, AIME, and olympiad problems. Level 5 is the hardest tier, requiring creative problem-solving.
HuggingFace evaluation of MATH Level 5 problems. Competition math requiring advanced reasoning and proof construction.
Mock AIME (American Invitational Mathematics Exam) problems from OTIS. Tests mathematical competition performance.
- Typetext
- Context33K tokens (~16 books)
- ReleasedSep 2024
- LicenseOpen Source
- StatusActive
- Cost / Message~$0.001