Qwen2.5 72B is the latest series of Qwen large language models. Qwen2.5 brings the following improvements upon Qwen2: - Significantly more knowledge and has greatly improved capabilities in coding and...
Tested on 25 benchmarks with 51.6% average. Top scores: Chatbot Arena Elo — Overall (1302.8%), ARC AI2 (92.7%), IFEval (86.4%).
Gemma 4 31B scores 63.9 (100% as good) at $0.12/1M input · 67% cheaper
Code editing benchmark from the Aider project. Measures ability to apply targeted code changes while maintaining correctness and style.
Unusual and adversarial machine learning challenges. Tests robustness of reasoning about edge cases in ML systems.
BIG-Bench Hard. 23 challenging tasks from BIG-Bench where prior language models fell below average human performance.
HuggingFace MuSR (Multi-Step Reasoning). Tests multi-hop reasoning requiring chaining multiple facts together.
Competition-level math from AMC, AIME, and olympiad problems. Level 5 is the hardest tier, requiring creative problem-solving.
HuggingFace evaluation of MATH Level 5 problems. Competition math requiring advanced reasoning and proof construction.
Mock AIME (American Invitational Mathematics Exam) problems from OTIS. Tests mathematical competition performance.
- Typetext
- Context131K tokens (~66 books)
- ReleasedSep 2024
- LicenseOpen Source
- StatusActive
- Cost / Message~$0.001