Kimi K2 Instruct is a large-scale Mixture-of-Experts (MoE) language model developed by Moonshot AI, featuring 1 trillion total parameters with 32 billion active per forward pass. It is optimized for...
Tested on 12 benchmarks with 56.2% average. Top scores: Lech Mazur Writing (86.9%), HELM — WildBench (86.2%), HELM — IFEval (85.0%).
DeepSeek V3.2 scores 58.7 (101% as good) at $0.25/1M input · 56% cheaper
Multi-language code editing from Aider. Tests editing ability across Python, JavaScript, TypeScript, Java, C++, Go, Rust, and more.
Unusual and adversarial machine learning challenges. Tests robustness of reasoning about edge cases in ML systems.
Complex terminal-based engineering tasks. Models must use command-line tools, navigate filesystems, and debug systems through shell interaction.
Stanford HELM WildBench evaluation. Tests reasoning on challenging real-world tasks.
Deceptively simple questions that humans find easy but AI models often get wrong. Tests common sense and reasoning gaps.
Stanford HELM evaluation of mathematical reasoning across diverse problem types.
- Typetext
- Context131K tokens (~66 books)
- ReleasedJul 2025
- LicenseOpen Source
- StatusActive
- Cost / Message~$0.003