Best Multimodal AI Models
A scoped ranking for models with visual and video benchmark evidence. It avoids claiming broad visual ability when only one benchmark is available.
Qwen2.5 72B Instruct
Alibaba Qwen
GPT-5.2
OpenAI
GPT-4o (2024-05-13)
OpenAI
Ranked model table
Scores are based on the visible benchmark set and available metadata.
| Rank | Model | Score | Evidence | Input price | Context |
|---|---|---|---|---|---|
| #1 | Qwen2.5 72B Instruct Alibaba Qwen | 87.5 | 1 benchmarks · Limited | $0.36/M | 33K |
| #2 | GPT-5.2 OpenAI | 87.5 | 1 benchmarks · Limited | $1.75/M | 400K |
| #3 | GPT-4o (2024-05-13) OpenAI | 87.5 | 1 benchmarks · Limited | $5.00/M | 128K |
| #4 | GPT-4o (2024-08-06) OpenAI | 71.4 | 1 benchmarks · Limited | $2.50/M | 128K |
| #5 | GPT-4o (2024-11-20) OpenAI | 59.7 | 3 benchmarks · Medium | $2.50/M | 128K |
| #6 | GPT-5 OpenAI | 56 | 1 benchmarks · Limited | $1.25/M | 400K |
| #7 | GPT-5.1 OpenAI | 43.2 | 1 benchmarks · Limited | $1.25/M | 400K |
| #8 | o4 Mini OpenAI | 41.1 | 1 benchmarks · Limited | $1.10/M | 200K |
| #9 | o3 OpenAI | 31.5 | 1 benchmarks · Limited | $2.00/M | 200K |
| #10 | Gemini 2.5 Pro Google DeepMind | 21.7 | 1 benchmarks · Limited | $1.25/M | 1.0M |
| #11 | GPT-5 Mini OpenAI | 10.9 | 1 benchmarks · Limited | $0.25/M | 400K |
| #12 | Claude Opus 4.5 Anthropic | 10.5 | 1 benchmarks · Limited | $5.00/M | 200K |
Multimodal coverage is uneven. Some models have strong video scores but fewer chart, image, or document results.
BenchGecko ranks models from published benchmark scores and model metadata. Scores do not measure every use case, and missing data can affect rankings.
Best AI Models for Reasoning
Reasoning models ranked from public benchmark scores across GPQA Diamond, BBH, ARC-AGI, SimpleBench, and related tests.
Best AI Models for Math
Math models ranked from public benchmark scores across GSM8K, MATH-level tests, AIME-style tasks, and FrontierMath where available.
Best Open-weight AI Models
Open-weight AI models ranked from available benchmark data, coverage confidence, pricing metadata, and listed license signals.
Questions
What counts as multimodal evidence?
BenchGecko uses published benchmarks for video understanding, image reasoning, charts, and visual question answering where those scores exist.
Why are some famous models missing?
A model may be missing if BenchGecko does not have a qualifying public score for the benchmarks used on this page.
Is this the same as an image generation ranking?
No. This page focuses on understanding and reasoning with visual inputs, not image generation quality.