Meta's latest class of model (Llama 3.1) launched with a variety of sizes & flavors. This 70B instruct-tuned version is optimized for high quality dialogue usecases. It has demonstrated strong...
Tested on 16 benchmarks with 37.8% average. Top scores: Chatbot Arena Elo — Overall (1292.8%), IFEval (86.7%), MMLU (73.5%).
Phi 4 scores 54.2 (101% as good) at $0.07/1M input · 84% cheaper
Code editing benchmark from the Aider project. Measures ability to apply targeted code changes while maintaining correctness and style.
Unusual and adversarial machine learning challenges. Tests robustness of reasoning about edge cases in ML systems.
HuggingFace MuSR (Multi-Step Reasoning). Tests multi-hop reasoning requiring chaining multiple facts together.
HuggingFace evaluation of MATH Level 5 problems. Competition math requiring advanced reasoning and proof construction.
Competition-level math from AMC, AIME, and olympiad problems. Level 5 is the hardest tier, requiring creative problem-solving.
Mock AIME (American Invitational Mathematics Exam) problems from OTIS. Tests mathematical competition performance.
- Typetext
- Context131K tokens (~66 books)
- ReleasedJul 2024
- LicenseOpen Source
- StatusActive
- Cost / Message~$0.001