Meta-llama text generation model. 383K downloads on HuggingFace.
Tested on 21 benchmarks with 38.0% average. Top scores: ARC AI2 (93.7%), HellaSwag (85.6%), TriviaQA (82.7%).
Unusual and adversarial machine learning challenges. Tests robustness of reasoning about edge cases in ML systems.
Capture-the-flag cybersecurity challenges. Tests vulnerability analysis, reverse engineering, cryptography, and exploitation skills.
BIG-Bench Hard. 23 challenging tasks from BIG-Bench where prior language models fell below average human performance.
Deceptively simple questions that humans find easy but AI models often get wrong. Tests common sense and reasoning gaps.
HuggingFace MuSR (Multi-Step Reasoning). Tests multi-hop reasoning requiring chaining multiple facts together.
Competition-level math from AMC, AIME, and olympiad problems. Level 5 is the hardest tier, requiring creative problem-solving.
Mock AIME (American Invitational Mathematics Exam) problems from OTIS. Tests mathematical competition performance.
HuggingFace evaluation of MATH Level 5 problems. Competition math requiring advanced reasoning and proof construction.
- Typetext-generation
- ContextN/A
- ReleasedJul 2024
- LicenseOpen Source
- StatusActive