DeepSeek-V3 is the latest model from the DeepSeek team, building upon the instruction following and coding abilities of the previous versions. Pre-trained on nearly 15 trillion tokens, the reported evaluations...
Tested on 22 benchmarks with 59.0% average. Top scores: Chatbot Arena Elo — Overall (1358.2%), ARC AI2 (93.7%), HellaSwag (85.2%).
Qwen3 Next 80B A3B Thinking scores 57.5 (99% as good) at $0.10/1M input · 70% cheaper
Multi-language code editing from Aider. Tests editing ability across Python, JavaScript, TypeScript, Java, C++, Go, Rust, and more.
Unusual and adversarial machine learning challenges. Tests robustness of reasoning about edge cases in ML systems.
BIG-Bench Hard. 23 challenging tasks from BIG-Bench where prior language models fell below average human performance.
Stanford HELM WildBench evaluation. Tests reasoning on challenging real-world tasks.
Deceptively simple questions that humans find easy but AI models often get wrong. Tests common sense and reasoning gaps.
Competition-level math from AMC, AIME, and olympiad problems. Level 5 is the hardest tier, requiring creative problem-solving.
Stanford HELM evaluation of mathematical reasoning across diverse problem types.
Mock AIME (American Invitational Mathematics Exam) problems from OTIS. Tests mathematical competition performance.
- Typetext
- Context164K tokens (~82 books)
- ReleasedDec 2024
- LicenseOpen Source
- StatusActive
- Cost / Message~$0.002