API

AI Capabilities

Which model is best for what? Capability rankings derived from 40 benchmarks across 5 skill areas.

7 benchmarksΒ·109 models tested

Code generation, debugging, and software engineering tasks β€” from writing functions to fixing real-world GitHub issues.

1Gemini 3.1 Pro Preview
75.3
2DeepSeek V3.2
74.2
3o3 Pro
71.6
4Claude Sonnet 4.6
66.1
5GPT-5 Pro
60.4
109 models rankedView all β†’
9 benchmarksΒ·165 models tested

Mathematical problem solving, logical reasoning, and multi-step inference β€” from arithmetic to competition-level mathematics.

1Qwen2.5 Coder 32B Instruct
91.1
2Qwen2.5 Coder 7B Instruct
86.7
3Claude Instant
86.7
4Qwen3 Max
85.2
5Gemini 2.0 Pro
83.5
165 models rankedView all β†’
20 benchmarksΒ·170 models tested

Factual knowledge, question answering, and academic reasoning β€” tested across science, history, medicine, law, and more.

1o3 Pro
91.7
2Grok 4 Fast
87.8
3DeepSeek V3.2 Exp
83.3
4DeepSeek-V2 (MoE-236B, May 2024)
77.3
5Kimi K2 0905
77.0
170 models rankedView all β†’
3 benchmarksΒ·33 models tested

Autonomous task execution, tool use, and multi-step planning β€” the frontier of AI agents working independently.

1Claude Sonnet 4.5
62.9
2DeepSeek V3.2 Exp
42.9
3Claude Opus 4.5
42.3
4Kimi K2.5
38.9
5Claude Sonnet 4
38.5
33 models rankedView all β†’
1 benchmarksΒ·11 models tested

Vision understanding, image analysis, and cross-modal reasoning β€” processing both text and visual inputs.

1Gemini 1.5 Pro (Feb 2024)
66.7
2Qwen2.5 72B Instruct
64.7
3GPT-4o (2024-11-20)
62.5
4GPT-4o (2024-08-06)
62.5
5GPT-4o (2024-05-13)
62.5
11 models rankedView all β†’