The Qwen3.5 native vision-language Flash models are built on a hybrid architecture that integrates a linear attention mechanism with a sparse mixture-of-experts model, achieving higher inference efficiency. Compared to the...
Tested on 8 benchmarks with 35.0% average. Top scores: Chatbot Arena Elo — Overall (1397.0%), Chatbot Arena Elo — Coding (1237.8%), OTIS Mock AIME 2024-2025 (85.5%).
Mistral Nemo scores 39.0 (102% as good) at $0.02/1M input · 69% cheaper
Mock AIME (American Invitational Mathematics Exam) problems from OTIS. Tests mathematical competition performance.
Original research-level math problems created by professional mathematicians. Problems are unpublished and cannot be memorized.
Hardest tier of FrontierMath. Problems at the frontier of human mathematical ability, many unsolved by most mathematicians.
Graduate-level science questions written by PhD experts. Diamond subset contains questions where experts disagree, testing deep understanding.
Tactical chess puzzles testing pattern recognition and multi-move calculation. Measures strategic reasoning ability.
Simple factual questions with verified correct answers. Tests accuracy of basic knowledge retrieval. Low scores indicate hallucination.
Chatbot Arena overall Elo rating. Crowdsourced human preference ranking from blind head-to-head comparisons across all topics.
Chatbot Arena coding Elo. Human preference ranking specifically for coding tasks and technical questions.
- Typemultimodal
- Context1.0M tokens (~500 books)
- ReleasedFeb 2026
- LicenseOpen Source
- StatusActive
- Cost / Message~$0.000