Qwen text generation model. 855K downloads on HuggingFace.
Tested on 6 benchmarks with 38.8% average. Top scores: GSM8K (65.8%), HellaSwag (49.1%), MMLU (38.1%).
Code editing benchmark from the Aider project. Measures ability to apply targeted code changes while maintaining correctness and style.
Grade school math word problems. 8,500 problems testing multi-step arithmetic reasoning. A foundational math benchmark.
Sentence completion requiring commonsense reasoning about physical and social situations. Tests real-world understanding.
Massive Multitask Language Understanding. 57 subjects from STEM, humanities, and social sciences. The most widely-cited knowledge benchmark.
AI2 Reasoning Challenge. Grade-school science questions requiring multi-step reasoning. Easy and Challenge sets test different difficulty levels.
- Typetext-generation
- ContextN/A
- ReleasedSep 2024
- LicenseOpen Source
- StatusActive