Gemini 3 Flash Preview is a high speed, high value thinking model designed for agentic workflows, multi turn chat, and coding assistance. It delivers near Pro level reasoning and tool...
Tested on 24 benchmarks with 49.1% average. Top scores: Chatbot Arena Elo — Overall (1473.9%), Chatbot Arena Elo — Coding (1436.4%), OTIS Mock AIME 2024-2025 (92.8%).
Qwen3 235B A22B Thinking 2507 scores 59.4 (101% as good) at $0.15/1M input · 70% cheaper
Real-world software engineering tasks from GitHub issues. Models must diagnose bugs and write patches that pass test suites. Human-verified subset of SWE-bench.
Complex terminal-based engineering tasks. Models must use command-line tools, navigate filesystems, and debug systems through shell interaction.
Unusual and adversarial machine learning challenges. Tests robustness of reasoning about edge cases in ML systems.
Deceptively simple questions that humans find easy but AI models often get wrong. Tests common sense and reasoning gaps.
ARC-AGI 2, harder sequel to ARC. More complex abstract reasoning patterns that test generalization ability beyond training data.
Abstraction and Reasoning Corpus. Tests fluid intelligence through novel visual pattern recognition puzzles. Core measure of general intelligence.
Mock AIME (American Invitational Mathematics Exam) problems from OTIS. Tests mathematical competition performance.
Original research-level math problems created by professional mathematicians. Problems are unpublished and cannot be memorized.
Hardest tier of FrontierMath. Problems at the frontier of human mathematical ability, many unsolved by most mathematicians.
- Typemultimodal
- Context1.0M tokens (~524 books)
- ReleasedDec 2025
- LicenseProprietary
- Statuspreview
- Cost / Message~$0.004