Beta

Benchmark

128 benchmark in 11 categorie. Selezionare un benchmark per visualizzare la classifica completa.

coding

Aider — Code Editing

coding
1OpenAI logoo1
84.2
2Anthropic logoClaude 3.5 Sonnet
84.2
3OpenAI logoo1-preview
79.7
4OpenAI logoGPT-4o (2024-05-13)
72.9
5OpenAI logoGPT-4o (2024-11-20)
71.4
27 modelli testati

Aider polyglot

coding

Aider Polyglot · measures how well AI models can edit code across multiple programming languages using the Aider coding assistant framework.

1OpenAI logoGPT-5 Chat
88.0
2OpenAI logoGPT-5
88.0
3OpenAI logoo3 Pro
84.9
4Google DeepMind logoGemini 2.5 Pro
83.1
5Google DeepMind logoGemini 2.5 Pro Preview 06-05
83.1
53 modelli testati

CadEval

coding

CadEval · evaluates the ability to generate and reason about Computer-Aided Design code, testing spatial reasoning and engineering knowledge.

1OpenAI logoo3
74.0
2Google DeepMind logoGemini 2.5 Pro
64.0
3OpenAI logoo4 Mini
62.0
4OpenAI logoo1
56.0
5OpenAI logoo3 Mini
54.0
15 modelli testati

Cybench

coding

Cybench · evaluates AI on real Capture-The-Flag cybersecurity challenges, testing vulnerability analysis, exploitation, and security reasoning.

1Anthropic logoClaude Opus 4.6
93.0
2Anthropic logoClaude Opus 4.5
82.0
3Anthropic logoClaude Sonnet 4.5
60.0
4xAI logoGrok 4
43.0
5Anthropic logoClaude Opus 4.1
42.0
20 modelli testati

GSO-Bench

coding

GSO-Bench · evaluates AI models on real-world open-source software engineering tasks, testing the ability to understand and resolve actual GitHub issues.

1Anthropic logoClaude Opus 4.6
33.3
2OpenAI logoGPT-5.2
27.4
3Anthropic logoClaude Opus 4.5
26.5
4Google DeepMind logoGemini 3 Pro
18.6
5Anthropic logoClaude Sonnet 4.5
14.7
18 modelli testati

LiveBench — Agentic Coding

coding
1OpenAI logoGPT-5.1-Codex-Max
56.7
2z-ai logoGLM 5.1
55.0
3Alibaba Qwen logoQwen3.6 Plus
55.0
4z-ai logoGLM 5
55.0
5OpenAI logoGPT-5.1-Codex
53.3
29 modelli testati

LiveBench — Coding

coding
1OpenAI logoGPT-5.2-Codex
83.6
2OpenAI logoGPT-5.1-Codex-Max
81.4
3Alibaba Qwen logoQwen3.6 Plus
78.2
4OpenAI logoGPT-5 Mini
76.1
5DeepSeek logoDeepSeek V3.2
75.7
29 modelli testati

OpenCompass — LiveCodeBenchV6

coding
1z-ai logoGLM 5
86.2
2stepfun logoStep 3.5 Flash
83.9
3z-ai logoGLM 4.7
83.8
4Alibaba Qwen logoQwen3.5 397B A17B
83.0
5DeepSeek logoDeepSeek V3.2 Speciale
80.9
32 modelli testati

SWE-bench Multilingual

coding
1Anthropic logoClaude Mythos Preview
87.3
1 modelli testati

SWE-bench Multimodal

coding
1Anthropic logoClaude Mythos Preview
59.0
1 modelli testati

SWE-bench Pro

coding
1Anthropic logoClaude Mythos Preview
77.8
1 modelli testati

SWE-Bench verified

coding

SWE-bench Verified · 500 human-validated tasks from 12 real Python repositories (Django, Flask, scikit-learn, sympy, and others). Each task requires the model to produce a git patch that resolves a real GitHub issue and passes the test suite. The verified subset eliminates ambiguous tasks from the original SWE-bench. Claude Mythos Preview leads at 93.9%, crossing 90% for the first time in 2026. Opus 4.6 scores 80.8%. The benchmark remains the most-cited evaluation for code-generation capability.

1Anthropic logoClaude Mythos Preview
93.9
2Anthropic logoClaude Opus 4.6
78.7
3OpenAI logoGPT-5.4
76.9
4Anthropic logoClaude Opus 4.5
76.7
5Google DeepMind logoGemini 3.1 Pro Preview
75.6
23 modelli testati

SWE-Bench Verified (Bash Only)

coding

SWE-Bench Verified (Bash Only) · a curated subset of SWE-bench where models fix real Python repository bugs using only bash commands, no agent frameworks.

1Anthropic logoClaude Opus 4.5
74.4
2OpenAI logoGPT-5.2
71.8
3Anthropic logoClaude Sonnet 4.5
70.6
4Anthropic logoClaude Opus 4
67.6
5OpenAI logoGPT-5.1
66.0
19 modelli testati

Terminal Bench

coding

Terminal-Bench 2.0 · evaluates AI agents on real terminal-based coding tasks · writing scripts, debugging, running tests, and managing projects entirely through command-line interaction. Tests both code quality and terminal fluency. Claude Opus 4.7 scores 69.4%, demonstrating significant agentic terminal competence.

1Anthropic logoClaude Mythos Preview
82.0
2Google DeepMind logoGemini 3.1 Pro Preview
78.4
3OpenAI logoGPT-5.3-Codex
77.3
4Anthropic logoClaude Opus 4.6
74.7
5Google DeepMind logoGemini 3 Pro
69.4
27 modelli testati

WeirdML

coding

WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns.

1OpenAI logoGPT-5.3-Codex
79.3
2Anthropic logoClaude Opus 4.6
77.9
3OpenAI logoGPT-5.2
72.2
4Google DeepMind logoGemini 3.1 Pro Preview
72.1
5Google DeepMind logoGemini 3 Pro
69.9
70 modelli testati

knowledge

ANLI

knowledge

ANLI (Adversarial NLI) · adversarially constructed natural language inference dataset where each round targets weaknesses found in previous model generations.

1Microsoft logophi-3-small 7.4B
37.1
2OpenAI logoGPT-3.5 Turbo (older v0613)
37.1
3Meta logoLlama 3 8B Instruct
36.0
4Microsoft logophi-3-medium 14B
33.7
5Mistral AI logoMixtral 8x7B Instruct
32.8
9 modelli testati

ARC AI2

knowledge

AI2 Reasoning Challenge · tests grade-school level science knowledge with multiple-choice questions requiring reasoning beyond simple retrieval.

1Meta logoLlama 3.1 405B
93.7
2DeepSeek logoDeepSeek V3
93.7
3Alibaba Qwen logoQwen2.5 72B Instruct
92.7
4DeepSeek logoDeepSeek-V2 (MoE-236B, May 2024)
89.6
5Microsoft logophi-3-medium 14B
88.8
35 modelli testati

AudioMultiChallenge

knowledge
1Google DeepMind logoGemini 2.5 Pro
46.9
2Google DeepMind logoGemini 2.5 Flash
40.0
2 modelli testati

AudioMultiChallenge — Audio Output

knowledge
0 modelli testati

AudioMultiChallenge — Text Output

knowledge
1Google DeepMind logoGemini 2.5 Pro
46.9
2Google DeepMind logoGemini 2.5 Flash
40.0
3Mistral AI logoVoxtral Small 24B 2507
26.3
3 modelli testati

Balrog

knowledge

Balrog · benchmarks AI agents on text-based adventure games, testing language understanding, strategic planning, and long-horizon reasoning.

1Google DeepMind logoGemini 3 Flash Preview
48.1
2xAI logoGrok 4
43.6
3Google DeepMind logoGemini 2.5 Pro
43.3
4DeepSeek logoR1
34.9
5Google DeepMind logoGemini 2.5 Flash
33.5
22 modelli testati

C-Eval

knowledge
1OpenAI logoGPT-4
68.7
2Meta logoLLaMA-13B
38.8
2 modelli testati

Chess Puzzles

knowledge

Chess Puzzles · tests strategic and tactical reasoning by having models solve chess puzzle positions, evaluating lookahead and pattern recognition abilities.

1OpenAI logoGPT-5.4 Pro
58.6
2Google DeepMind logoGemini 3.1 Pro Preview
55.0
3OpenAI logoGPT-5.2
49.0
4OpenAI logoGPT-5.4
44.0
5Google DeepMind logoGemini 3 Flash Preview
38.0
24 modelli testati

CMMLU

knowledge
1Alibaba Qwen logoQwen2-72B
89.7
2Alibaba Qwen logoQwen2.5 72B Instruct
85.7
3OpenAI logoGPT-4 Turbo
71.0
4Meta logoLlama 3.1 70B Instruct
64.4
5Alibaba Qwen logoQwen-14B
58.7
8 modelli testati

CSQA2

knowledge
1OpenAI logoGPT-3.5 Turbo (older v0613)
14.0
2Meta logoLlama 2-13B
0.1
2 modelli testati

DeepResearch Bench

knowledge

DeepResearch Bench · evaluates AI on complex multi-step research tasks requiring information gathering, synthesis, and producing comprehensive analyses.

1OpenAI logoGPT-5
55.1
2Anthropic logoClaude Sonnet 4.5
52.6
3Google DeepMind logoGemini 2.5 Pro
49.7
4Anthropic logoClaude Opus 4.1
49.7
5Anthropic logoClaude Opus 4
49.0
13 modelli testati

EnigmaEval

knowledge
1Google DeepMind logoGemini 3.1 Pro Preview
19.8
2OpenAI logoo3
13.1
2 modelli testati

Fiction.LiveBench

knowledge

Fiction.LiveBench · a continuously updated benchmark using recently published fiction to test reading comprehension and reasoning, preventing data contamination.

1OpenAI logoGPT-5
97.2
2OpenAI logoo3 Pro
97.2
3xAI logoGrok 4 Fast
94.4
4xAI logoGrok 4
94.4
5Google DeepMind logoGemini 2.5 Pro
91.7
41 modelli testati

GeoBench

knowledge

GeoBench · tests geographic knowledge and spatial reasoning across countries, landmarks, coordinates, and geopolitical understanding.

1Google DeepMind logoGemini 3 Flash Preview
88.0
2Google DeepMind logoGemini 3 Pro
84.0
3OpenAI logoGPT-5
81.0
4Google DeepMind logoGemini 2.5 Pro
81.0
5OpenAI logoo1
80.0
26 modelli testati

GPQA

knowledge
1Meta logoMeta Llama 3 8B
19.7
2
HA
Qwen2.5 72B Instruct Abliterated
19.4
3Alibaba Qwen logoQwen2-72B
19.2
4DeepSeek logoDeepSeek R1 Distill Qwen 14B
18.3
5Microsoft logoWizardLM-2 8x22B
17.6
73 modelli testati

GPQA diamond

knowledge

Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.

1Anthropic logoClaude Mythos Preview
94.5
2OpenAI logoGPT-5.4 Pro
92.8
3Google DeepMind logoGemini 3.1 Pro Preview
92.1
4OpenAI logoGPT-5.4
91.1
5Google DeepMind logoGemini 3 Pro
90.2
96 modelli testati

HellaSwag

knowledge

HellaSwag · tests commonsense reasoning by asking models to predict the most plausible continuation of everyday scenarios.

1OpenAI logoGPT-4 Turbo
93.7
2Meta logoLlama 3.1 405B
85.6
3TII logoFalcon-180B
85.3
4DeepSeek logoDeepSeek V3
85.2
5DeepSeek logoDeepSeek-V2 (MoE-236B, May 2024)
82.8
29 modelli testati

HELM — GPQA

knowledge
1Google DeepMind logoGemini 3 Pro
80.3
2OpenAI logoGPT-5 Chat
79.1
3OpenAI logoGPT-5 Mini
75.6
4OpenAI logoo3
75.3
5Google DeepMind logoGemini 2.5 Pro
74.9
34 modelli testati

HELM — MMLU-Pro

knowledge
1Google DeepMind logoGemini 3 Pro
90.3
2OpenAI logoGPT-5 Chat
86.3
3Google DeepMind logoGemini 2.5 Pro
86.3
4OpenAI logoo3
85.9
5xAI logoGrok 4
85.1
34 modelli testati

HLE

knowledge

HLE (Humanity's Last Exam) · a reasoning benchmark designed to be the hardest public evaluation of AI. Questions span mathematics, physics, philosophy, and logic · curated to be at or beyond the frontier of human expert capability. Tested with and without tool augmentation. Claude Opus 4.7 scores 46.9% without tools and 54.7% with tools · making it one of the few benchmarks where the top score is below 60%.

1Anthropic logoClaude Mythos Preview
56.8
2Google DeepMind logoGemini 3 Pro
34.4
3Anthropic logoClaude Opus 4.6
31.1
4OpenAI logoGPT-5 Pro
28.2
5OpenAI logoGPT-5.2
24.2
23 modelli testati

Humanity's Last Exam

knowledge
0 modelli testati

Humanity's Last Exam (Text Only)

knowledge
0 modelli testati

LAMBADA

knowledge

LAMBADA · measures the ability to predict the final word of a passage, requiring broad contextual understanding across long text spans.

1TII logoFalcon-180B
79.8
2Meta logoLlama 2-13B
76.5
3Meta logoLLaMA-13B
75.2
4
U
Baichuan 2-7B
73.3
5
U
Stable Beluga 2
71.3
7 modelli testati

Lech Mazur Writing

knowledge

Lech Mazur Writing · evaluates creative writing ability, assessing prose quality, narrative coherence, and stylistic sophistication.

1OpenAI logoGPT-5
87.2
2Alibaba Qwen logoQwen3 Max
87.1
3moonshotai logoKimi K2 0711
86.9
4OpenAI logoo3 Pro
86.3
5Google DeepMind logoGemini 2.5 Pro
86.0
39 modelli testati

LiveBench — Overall

knowledge
1OpenAI logoGPT-5.2-Codex
74.3
2OpenAI logoGPT-5.1-Codex-Max
72.0
3Alibaba Qwen logoQwen3.6 Plus
70.8
4z-ai logoGLM 5.1
70.2
5z-ai logoGLM 5
68.8
29 modelli testati

MMLU

knowledge

Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.

1DeepSeek logoDeepSeek V3
82.9
2Anthropic logoClaude 3.5 Sonnet
82.0
3OpenAI logoGPT-4 (older v0314)
81.9
4Meta logoLlama 3.3 70B Instruct (free)
81.7
5Alibaba Qwen logoQwen2.5 72B Instruct
80.4
67 modelli testati

MMLU-PRO

knowledge
1Alibaba Qwen logoQwen2-72B
52.6
2Alibaba logoQwen2.5 32B Instruct
51.9
3Alibaba Qwen logoQwen2.5 72B Instruct
51.4
4
HA
Qwen2.5 72B Instruct Abliterated
50.4
5Microsoft logoPhi 4
48.6
73 modelli testati

MMMLU

knowledge
1Anthropic logoClaude Mythos Preview
92.7
1 modelli testati

MultiChallenge

knowledge
1Google DeepMind logoGemini 3.1 Pro Preview
71.4
1 modelli testati

MultiNRC

knowledge
1Google DeepMind logoGemini 3.1 Pro Preview
64.7
1 modelli testati

OpenBookQA

knowledge

OpenBookQA · science questions that require combining a given core fact with broad common knowledge, mimicking an open-book exam setting.

1Microsoft logophi-3-mini 3.8B
84.0
2Microsoft logophi-3-small 7.4B
84.0
3Microsoft logophi-3-medium 14B
83.2
4OpenAI logoGPT-3.5 Turbo (older v0613)
81.3
5Mistral AI logoMixtral 8x7B Instruct
81.1
19 modelli testati

OpenCompass — GPQA-Diamond

knowledge
1Alibaba Qwen logoQwen3.5 397B A17B
88.4
2moonshotai logoKimi K2.5
88.1
3z-ai logoGLM 4.7
86.9
4DeepSeek logoDeepSeek V3.2 Speciale
86.7
5z-ai logoGLM 5
85.3
32 modelli testati

OpenCompass — HLE

knowledge
1DeepSeek logoDeepSeek V3.2 Speciale
28.6
2moonshotai logoKimi K2.5
28.6
3z-ai logoGLM 5
28.1
4Alibaba Qwen logoQwen3.5 397B A17B
27.5
5z-ai logoGLM 4.7
25.4
32 modelli testati

OpenCompass — MMLU-Pro

knowledge
1Alibaba Qwen logoQwen3.5 397B A17B
87.6
2moonshotai logoKimi K2.5
86.2
3DeepSeek logoDeepSeek V3.2
85.8
4Google DeepMind logoGemini 2.5 Pro
85.8
5DeepSeek logoDeepSeek V3.2 Speciale
85.5
32 modelli testati

PIQA

knowledge

PIQA (Physical Interaction QA) · tests intuitive physical reasoning by asking models to select the correct approach for everyday physical tasks.

1OpenAI logoGPT-4o-mini (2024-07-18)
77.4
2OpenAI logoGPT-4o-mini
77.4
3Google DeepMind logoGemini 1.5 Flash (May 2024)
75.0
4Meta logoLlama 3.1 405B
71.8
5TII logoFalcon-180B
69.8
25 modelli testati

PostTrainBench

knowledge
1Anthropic logoClaude Opus 4.6
23.2
2Google DeepMind logoGemini 3.1 Pro Preview
21.6
3OpenAI logoGPT-5.2
21.4
4OpenAI logoGPT-5.4
20.2
5Google DeepMind logoGemini 3 Pro
18.1
15 modelli testati

Professional Reasoning — Finance

knowledge
1Anthropic logoClaude Opus 4.6 (Fast)
53.3
2OpenAI logoGPT-5
51.3
3OpenAI logoGPT-5 Pro
51.1
4OpenAI logoo3 Pro
49.1
5OpenAI logoo3
47.7
5 modelli testati

Professional Reasoning — Legal

knowledge
1Anthropic logoClaude Opus 4.6 (Fast)
52.3
2OpenAI logoGPT-5 Pro
49.9
3OpenAI logoo3 Pro
49.7
4OpenAI logoGPT-5
49.0
5OpenAI logoo3
48.6
5 modelli testati

ScienceQA

knowledge

ScienceQA · multimodal science questions spanning natural science, social science, and language science with diverse question formats and image context.

1OpenAI logoGPT-4o (2024-05-13)
84.7
2OpenAI logoGPT-4o (2024-11-20)
84.7
3Anthropic logoClaude 3 Haiku
62.7
4Meta logoLlama 2-13B
41.0
5Meta logoLLaMA-13B
24.4
5 modelli testati

SciPredict

knowledge
1Anthropic logoClaude Opus 4.5
23.1
2Google DeepMind logoGemini 3 Flash Preview
22.2
2 modelli testati

SimpleQA Verified

knowledge

SimpleQA Verified · short factual questions with verified answers, measuring factual accuracy and the tendency to hallucinate or provide incorrect information.

1Google DeepMind logoGemini 3.1 Pro Preview
77.3
2Google DeepMind logoGemini 3 Pro
72.9
3Alibaba Qwen logoQwen3 Max
67.5
4Google DeepMind logoGemini 3 Flash Preview
67.4
5
U
Muse Spark
66.3
32 modelli testati

TriviaQA

knowledge

TriviaQA · reading comprehension benchmark with trivia questions, requiring models to find and reason over evidence from provided documents.

1Anthropic logoClaude 2
87.5
2OpenAI logoGPT-3.5 Turbo (older v0613)
85.8
3OpenAI logoGPT-4 Turbo
84.8
4DeepSeek logoDeepSeek V3
82.9
5Meta logoLlama 3.1 405B
82.7
20 modelli testati

TutorBench

knowledge
1Google DeepMind logoGemini 2.5 Pro Preview 06-05
55.6
2moonshotai logoKimi K2.5
54.6
2 modelli testati

VISTA

knowledge
1Google DeepMind logoGemini 2.5 Pro Preview 06-05
54.6
2OpenAI logoo4 Mini
51.8
2 modelli testati

VisualToolBench (VTB)

knowledge
1Google DeepMind logoGemini 3.1 Pro Preview
29.0
2Anthropic logoClaude Opus 4.6 (Fast)
27.5
2 modelli testati

VPCT

knowledge

VPCT (Visual Pattern Completion Test) · tests visual reasoning and pattern recognition by having models complete visual sequences and transformations.

1Google DeepMind logoGemini 3 Pro
86.5
2OpenAI logoGPT-5.2
76.0
3Google DeepMind logoGemini 3 Flash Preview
58.9
4OpenAI logoGPT-5
49.0
5OpenAI logoGPT-5.1
38.0
22 modelli testati

Winogrande

knowledge

WinoGrande · large-scale commonsense reasoning benchmark where models must resolve ambiguous pronouns in carefully constructed sentence pairs.

1Meta logoLlama 3.1 405B
78.4
2Anthropic logoClaude 3 Opus
77.0
3OpenAI logoGPT-4 (older v0314)
75.0
4OpenAI logoGPT-4 Turbo
75.0
5TII logoFalcon-180B
74.2
38 modelli testati

agentic

APEX-Agents

agentic

APEX-Agents · evaluates AI agents on complex, multi-step tasks requiring planning, tool use, and autonomous decision-making in realistic environments.

1OpenAI logoGPT-5.4
35.9
2OpenAI logoGPT-5.2
34.3
3Google DeepMind logoGemini 3.1 Pro Preview
33.5
4OpenAI logoGPT-5.3-Codex
31.7
5Anthropic logoClaude Opus 4.6
31.7
17 modelli testati

MCP Atlas

agentic
1Anthropic logoClaude Opus 4.5
62.3
2Google DeepMind logoGemini 3 Flash Preview
57.4
2 modelli testati

OSWorld

agentic

OSWorld · tests AI agents on real-world computer tasks across operating systems, including web browsing, file management, and application use.

1Anthropic logoClaude Mythos Preview
79.6
2Anthropic logoClaude Opus 4.5
66.3
3moonshotai logoKimi K2.5
63.3
4Anthropic logoClaude Sonnet 4.5
62.9
5Anthropic logoClaude Sonnet 4
43.9
8 modelli testati

Remote Labor Index (RLI)

agentic
1Anthropic logoClaude Opus 4.6 (Fast)
4.2
1 modelli testati

SWE Atlas — Codebase QnA

agentic
1Anthropic logoClaude Opus 4.6 (Fast)
33.3
2OpenAI logoGPT-5.3-Codex
32.6
3Anthropic logoClaude Sonnet 4.6
31.2
3 modelli testati

SWE Atlas — Test Writing

agentic
1Anthropic logoClaude Opus 4.6 (Fast)
36.7
2Anthropic logoClaude Sonnet 4.6
31.8
2 modelli testati

SWE-Bench Pro (Private)

agentic
1Anthropic logoClaude Opus 4.5
23.4
2Google DeepMind logoGemini 2.5 Pro Preview 06-05
10.1
2 modelli testati

SWE-Bench Pro (Public)

agentic
1Anthropic logoClaude Opus 4.5
45.9
2OpenAI logoGPT-5.2-Codex
41.0
2 modelli testati

The Agent Company

agentic

The Agent Company · tests AI agents on realistic corporate tasks like email management, code review, data analysis, and cross-tool workflows.

1DeepSeek logoDeepSeek V3.2 Exp
42.9
2Google DeepMind logoGemini 2.5 Flash
41.1
3Anthropic logoClaude Sonnet 4
33.1
4Anthropic logoClaude 3.7 Sonnet
30.9
5Google DeepMind logoGemini 2.5 Pro
30.3
13 modelli testati

reasoning

ARC-AGI

reasoning

ARC-AGI · the original Abstraction and Reasoning Corpus, testing whether AI can solve novel visual pattern recognition tasks without memorization.

1Google DeepMind logoGemini 3.1 Pro Preview
98.0
2OpenAI logoGPT-5.4 Pro
94.5
3Anthropic logoClaude Opus 4.6
94.0
4OpenAI logoGPT-5.4
93.7
5OpenAI logoGPT-5.2 Pro
90.5
48 modelli testati

ARC-AGI-2

reasoning

ARC-AGI-2 · the second iteration of the Abstraction and Reasoning Corpus, testing novel pattern recognition and abstract reasoning without prior training data.

1OpenAI logoGPT-5.4 Pro
83.3
2Google DeepMind logoGemini 3.1 Pro Preview
77.1
3OpenAI logoGPT-5.4
74.0
4Anthropic logoClaude Opus 4.6
69.2
5Anthropic logoClaude Sonnet 4.6
60.4
50 modelli testati

BBH

reasoning

BIG-Bench Hard · a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average humans.

1DeepSeek logoDeepSeek V3
83.3
2Google DeepMind logoGemini 1.5 Pro (Feb 2024)
78.7
3Meta logoLlama 3.1 405B
77.2
4Microsoft logophi-3-medium 14B
75.2
5Alibaba Qwen logoQwen2.5 72B Instruct
73.1
24 modelli testati

CharXiv Reasoning

reasoning

CharXiv Reasoning · tests a model's ability to understand and reason about charts, plots, and figures extracted from real arXiv scientific papers. The model must answer questions that require reading axes, comparing data series, identifying trends, and drawing conclusions from visual scientific data. This benchmark specifically measures the intersection of visual understanding and scientific reasoning · a critical capability for research assistants and document analysis. Performance varies dramatically with and without tool use, making it a key differentiator for multimodal and agentic AI systems.

1Anthropic logoClaude Mythos Preview
86.1
1 modelli testati

CharXiv Reasoning (with tools)

reasoning

CharXiv Reasoning (with tools) · the tool-augmented variant of CharXiv Reasoning. Models can use code execution, image processing, and other tools to analyze charts from arXiv papers. Claude Mythos Preview scores 93.2% with tools vs 86.1% without · demonstrating how tool use dramatically improves visual scientific reasoning. The gap between tool-augmented and bare performance is a key signal for agent capability.

1Anthropic logoClaude Mythos Preview
93.2
1 modelli testati

GraphWalks BFS 256K-1M

reasoning

GraphWalks BFS 256K-1M · a long-context reasoning benchmark created by OpenAI that tests whether a model can perform breadth-first search (BFS) traversal across massive graphs encoded in 256,000 to 1,024,000 tokens of context. The model receives a graph represented as an edge list and must follow parent-child relationships across the entire extended context window. This is not simple retrieval · it requires true relational reasoning over hundreds of thousands of tokens. The dataset includes 100 problems at 256K context and 100 problems at 1,024K context. Claude Mythos Preview leads at 80.0%, more than doubling Opus 4.6 (38.7%) and far exceeding GPT-5.4 (21.4%). The massive performance gap between models makes this one of the most discriminating benchmarks for real long-context capability in 2026.

1Anthropic logoClaude Mythos Preview
80.0
1 modelli testati

HELM — WildBench

reasoning
1OpenAI logoGPT-5.1
86.3
2moonshotai logoKimi K2 0711
86.2
3OpenAI logoo3
86.1
4Google DeepMind logoGemini 3 Pro
85.9
5OpenAI logoGPT-5 Chat
85.7
34 modelli testati

HLE (with tools)

reasoning
1Anthropic logoClaude Mythos Preview
64.7
1 modelli testati

LiveBench — Data Analysis

reasoning
1OpenAI logoGPT-5.2-Codex
78.2
2Alibaba Qwen logoQwen3.6 Plus
69.9
3z-ai logoGLM 5
67.9
4z-ai logoGLM 5.1
63.2
5OpenAI logoGPT-5.1-Codex
60.8
29 modelli testati

LiveBench — Reasoning

reasoning
1OpenAI logoGPT-5.1-Codex-Max
84.6
2OpenAI logoGPT-5.1-Codex
82.0
3OpenAI logoGPT-5.2-Codex
77.7
4Alibaba Qwen logoQwen3.6 Plus
75.8
5minimax logoMiniMax M2.7
74.8
29 modelli testati

MUSR

reasoning
1DeepSeek logoDeepSeek R1 Distill Qwen 14B
28.7
2nousresearch logoHermes 3 70B Instruct
23.4
3Meta logoLlama 3 8B Instruct
19.9
4Alibaba Qwen logoQwen2-72B
19.7
5
U
Stable Beluga 2
18.6
73 modelli testati

SimpleBench

reasoning

SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.

1Google DeepMind logoGemini 3.1 Pro Preview
75.5
2Google DeepMind logoGemini 3 Pro
71.7
3OpenAI logoGPT-5.4 Pro
68.9
4Anthropic logoClaude Opus 4.6
61.1
5Google DeepMind logoGemini 2.5 Pro
54.9
52 modelli testati

speed

Artificial Analysis — Agentic Index

speed

Artificial Analysis Agentic Index · a composite score measuring how well a model performs in agentic workflows · multi-step tool use, planning, error recovery, and autonomous task completion. Aggregates results from multiple agentic benchmarks including SWE-bench, tool-use tests, and planning evaluations. The canonical single-number metric for "how good is this model as an agent?"

1OpenAI logoGPT-5.4
69.4
2Anthropic logoClaude Opus 4.6 (Fast)
67.6
3z-ai logoGLM 5.1
67.0
4z-ai logoGLM 5 Turbo
63.1
5Anthropic logoClaude Sonnet 4.6
63.0
66 modelli testati

Artificial Analysis — Coding Index

speed

Artificial Analysis Coding Index · a composite score that aggregates performance across multiple coding benchmarks into a single index. Tracks code generation quality, debugging ability, multi-language competence, and real-world software engineering tasks. Used by Artificial Analysis to rank model coding capability in a normalized, comparable format. Useful for developers choosing between models for coding-heavy workloads.

1OpenAI logoGPT-5.4
57.3
2Google DeepMind logoGemini 3.1 Pro Preview
55.5
3OpenAI logoGPT-5.3-Codex
53.1
4OpenAI logoGPT-5.4 Mini
51.5
5Anthropic logoClaude Sonnet 4.6
50.9
67 modelli testati

Artificial Analysis — Quality Index

speed
1Google DeepMind logoGemini 3.1 Pro Preview
57.2
2OpenAI logoGPT-5.4
57.2
3OpenAI logoGPT-5.3-Codex
54.0
4Anthropic logoClaude Opus 4.6 (Fast)
53.0
5
U
Muse Spark
52.1
68 modelli testati

general

BBH (HuggingFace)

general
1Alibaba Qwen logoQwen2.5 72B Instruct
61.9
2
HA
Qwen2.5 72B Instruct Abliterated
60.5
3Meta logoLlama 3.3 70B Instruct
56.6
4Alibaba logoQwen2.5 32B Instruct
56.5
5Meta logoLlama 3.1 70B Instruct
55.9
73 modelli testati

arena

Chatbot Arena Elo — Coding

arena
1Anthropic logoClaude Opus 4.6 (Fast)
1546.2
2Anthropic logoClaude Opus 4.6
1542.9
3Anthropic logoClaude Sonnet 4.6
1521.0
4Anthropic logoClaude Opus 4.5
1465.2
5Google DeepMind logoGemini 3.1 Pro Preview
1455.7
27 modelli testati

Chatbot Arena Elo — Overall

arena
1Anthropic logoClaude Opus 4.6 (Fast)
1502.8
2Anthropic logoClaude Opus 4.6
1496.6
3Google DeepMind logoGemini 3.1 Pro Preview
1492.6
4Google DeepMind logoGemini 3 Pro
1486.2
5Google DeepMind logoGemini 3 Flash Preview
1473.9
113 modelli testati

safety

Fortress

safety
1Anthropic logoClaude Opus 4.5
13.6
2Anthropic logoClaude 3.5 Sonnet
13.0
3OpenAI logogpt-oss-120b
8.2
3 modelli testati

MASK

safety
1Anthropic logoClaude Opus 4.6 (Fast)
96.3
2Anthropic logoClaude Sonnet 4
95.3
2 modelli testati

PropensityBench

safety
1Alibaba logoQwen2.5 32B Instruct
22.9
1 modelli testati

math

FrontierMath-2025-02-28-Private

math

FrontierMath (Feb 2025) · original research-level math problems created by mathematicians, testing capabilities at the boundary of current AI mathematical reasoning.

1OpenAI logoGPT-5.4 Pro
50.0
2OpenAI logoGPT-5.4
47.6
3Anthropic logoClaude Opus 4.6
40.7
4OpenAI logoGPT-5.2
40.7
5
U
Muse Spark
39.0
54 modelli testati

FrontierMath-Tier-4-2025-07-01-Private

math

FrontierMath Tier 4 (Jul 2025) · the most challenging tier of frontier mathematics, containing problems that push the absolute limits of AI mathematical reasoning.

1OpenAI logoGPT-5.4 Pro
37.5
2OpenAI logoGPT-5.2 Pro
31.3
3OpenAI logoGPT-5.4
27.1
4Anthropic logoClaude Opus 4.6
22.9
5OpenAI logoGPT-5.2
18.8
37 modelli testati

GSM8K

math

Grade School Math 8K · 8,500 linguistically diverse grade-school math word problems that require multi-step reasoning to solve.

1OpenAI logoGPT-4 (older v0314)
92.0
2OpenAI logoGPT-4o-mini (2024-07-18)
91.3
3OpenAI logoGPT-4o-mini
91.3
4Alibaba Qwen logoQwen2.5 Coder 32B Instruct
91.1
5OpenAI logoGPT-4 Turbo
90.0
32 modelli testati

HELM — Omni-MATH

math
1OpenAI logoGPT-5 Mini
72.2
2OpenAI logoo4 Mini
72.0
3OpenAI logoo3
71.4
4OpenAI logogpt-oss-120b
68.8
5moonshotai logoKimi K2 0711
65.4
34 modelli testati

LiveBench — Mathematics

math
1OpenAI logoGPT-5.2-Codex
88.8
2z-ai logoGLM 5.1
84.9
3Alibaba Qwen logoQwen3.6 Plus
83.7
4OpenAI logoGPT-5.1-Codex-Max
83.7
5z-ai logoGLM 5
83.5
29 modelli testati

MATH level 5

math

MATH Level 5 · the hardest tier of the MATH benchmark, featuring competition-level problems from AMC, AIME, and Olympiad-style mathematics.

1OpenAI logoGPT-5
98.1
2OpenAI logoGPT-5 Mini
97.8
3OpenAI logoo4 Mini
97.8
4OpenAI logoo3
97.8
5Anthropic logoClaude Sonnet 4.5
97.7
72 modelli testati

MATH Level 5

math
1Alibaba logoQwen2.5 32B Instruct
62.5
2
HA
Qwen2.5 72B Instruct Abliterated
60.1
3Alibaba Qwen logoQwen2.5 72B Instruct
59.8
4DeepSeek logoDeepSeek R1 Distill Qwen 14B
57.0
5Alibaba logoQwen2.5 14B Instruct
55.3
73 modelli testati

OpenCompass — AIME2025

math
1DeepSeek logoDeepSeek V3.2 Speciale
96.0
2z-ai logoGLM 5
95.8
3stepfun logoStep 3.5 Flash
95.7
4z-ai logoGLM 4.7
95.4
5moonshotai logoKimi K2 Thinking
94.1
32 modelli testati

OTIS Mock AIME 2024-2025

math

OTIS Mock AIME 2024-2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.

1OpenAI logoGPT-5.2
96.1
2Google DeepMind logoGemini 3.1 Pro Preview
95.6
3OpenAI logoGPT-5.4
95.3
4Anthropic logoClaude Opus 4.6
94.4
5Google DeepMind logoGemini 3 Flash Preview
92.8
86 modelli testati

USAMO

math
1Anthropic logoClaude Mythos Preview
97.6
1 modelli testati

language

HELM — IFEval

language
1xAI logoGrok 3 Mini Beta
95.1
2xAI logoGrok 4
94.9
3OpenAI logoGPT-5.1
93.5
4OpenAI logoGPT-5 Nano
93.2
5OpenAI logoo4 Mini
92.9
34 modelli testati

IFEval

language
1Meta logoLlama 3.3 70B Instruct
90.0
2Meta logoLlama 3.1 70B Instruct
86.7
3Alibaba Qwen logoQwen2.5 72B Instruct
86.4
4
HA
Qwen2.5 72B Instruct Abliterated
85.9
5Alibaba logoQwen2.5 32B Instruct
83.5
73 modelli testati

JCommonsenseQA

language
1DeepSeek logoDeepSeek R1 Distill Qwen 14B
93.7
2Alibaba logoQwen2 7B Instruct
89.1
3Alibaba logoQwen2 VL 7B Instruct
87.8
4Meta logoMeta Llama 3 8B Instruct
87.7
5Meta logoMeta Llama 3 8B
82.9
11 modelli testati

JHumanEval

language
0 modelli testati

JMMLU

language
1DeepSeek logoDeepSeek R1 Distill Qwen 14B
63.4
2Alibaba logoQwen2 7B Instruct
56.5
3Alibaba logoQwen2 VL 7B Instruct
56.3
4Meta logoMeta Llama 3 8B Instruct
46.7
5Meta logoMeta Llama 3 8B
44.7
11 modelli testati

JNLI

language
1DeepSeek logoDeepSeek R1 Distill Qwen 14B
82.4
2Alibaba logoQwen2 7B Instruct
81.3
3Alibaba logoQwen2 VL 7B Instruct
74.4
4DeepSeek logoDeepSeek R1 Distill Llama 8B
69.4
5Meta logoMeta Llama 3 8B Instruct
61.1
11 modelli testati

JSQuAD

language
1Alibaba logoQwen2 VL 7B Instruct
89.9
2DeepSeek logoDeepSeek R1 Distill Qwen 14B
89.8
3Alibaba logoQwen2 7B Instruct
89.6
4Meta logoMeta Llama 3 8B Instruct
89.5
5Meta logoMeta Llama 3 8B
88.9
11 modelli testati

LiveBench — If

language
1z-ai logoGLM 5.1
68.5
2Google DeepMind logoGemma 4 31B
67.6
3OpenAI logoGPT-5.1-Codex-Max
67.1
4OpenAI logoGPT-5.2-Codex
66.5
5OpenAI logoGPT-5 Mini
64.2
29 modelli testati

LiveBench — Language

language
1z-ai logoGLM 5
77.5
2OpenAI logoGPT-5.1-Codex-Max
75.4
3Alibaba Qwen logoQwen3.6 Plus
75.0
4OpenAI logoGPT-5.2-Codex
73.7
5z-ai logoGLM 5.1
71.8
29 modelli testati

LLM-JP — Overall

language
1DeepSeek logoDeepSeek R1 Distill Qwen 14B
56.8
2Alibaba logoQwen2 VL 7B Instruct
53.0
3Alibaba logoQwen2 7B Instruct
51.7
4Meta logoMeta Llama 3 8B Instruct
49.6
5Meta logoMeta Llama 3 8B
48.9
11 modelli testati

MMMLU — Arabic

language
1Alibaba logoQwen2 7B Instruct
50.7
2Meta logoMeta Llama 3 8B Instruct
40.5
2 modelli testati

MMMLU — Bengali

language
1Alibaba logoQwen2 7B Instruct
43.4
2Meta logoMeta Llama 3 8B Instruct
36.4
2 modelli testati

MMMLU — Chinese

language
1Alibaba logoQwen2 7B Instruct
61.8
2Meta logoMeta Llama 3 8B Instruct
51.4
2 modelli testati

MMMLU — French

language
1Alibaba logoQwen2 7B Instruct
60.8
2Meta logoMeta Llama 3 8B Instruct
55.8
2 modelli testati

MMMLU — German

language
1Alibaba logoQwen2 7B Instruct
57.1
2Meta logoMeta Llama 3 8B Instruct
53.5
2 modelli testati

MMMLU — Hindi

language
1Alibaba logoQwen2 7B Instruct
45.1
2Meta logoMeta Llama 3 8B Instruct
41.4
2 modelli testati

MMMLU — Indonesian

language
1Alibaba logoQwen2 7B Instruct
54.1
2Meta logoMeta Llama 3 8B Instruct
51.0
2 modelli testati

MMMLU — Italian

language
1Alibaba logoQwen2 7B Instruct
59.0
2Meta logoMeta Llama 3 8B Instruct
53.3
2 modelli testati

MMMLU — Japanese

language
1Alibaba logoQwen2 7B Instruct
56.6
2Meta logoMeta Llama 3 8B Instruct
42.3
2 modelli testati

MMMLU — Korean

language
1Alibaba logoQwen2 7B Instruct
54.0
2Meta logoMeta Llama 3 8B Instruct
46.5
2 modelli testati

MMMLU — Portuguese

language
1Alibaba logoQwen2 7B Instruct
60.1
2Meta logoMeta Llama 3 8B Instruct
55.5
2 modelli testati

MMMLU — Spanish

language
1Alibaba logoQwen2 7B Instruct
60.2
2Meta logoMeta Llama 3 8B Instruct
55.8
2 modelli testati

MMMLU — Swahili

language
1Meta logoMeta Llama 3 8B Instruct
37.5
2Alibaba logoQwen2 7B Instruct
34.3
2 modelli testati

MMMLU — Yoruba

language
1Meta logoMeta Llama 3 8B Instruct
31.0
2Alibaba logoQwen2 7B Instruct
30.2
2 modelli testati

OpenCompass — IFEval

language
1moonshotai logoKimi K2.5
93.9
2z-ai logoGLM 5
93.2
3stepfun logoStep 3.5 Flash
93.2
4moonshotai logoKimi K2 Thinking
92.4
5DeepSeek logoDeepSeek V3.2 Speciale
91.7
32 modelli testati

multimodal

VideoMME

multimodal

VideoMME · multimodal benchmark testing video understanding across diverse domains, requiring temporal reasoning and cross-frame comprehension.

1Google DeepMind logoGemini 1.5 Pro (Feb 2024)
66.7
2Alibaba Qwen logoQwen2.5 72B Instruct
64.7
3OpenAI logoGPT-4o (2024-11-20)
62.5
4OpenAI logoGPT-4o (2024-08-06)
62.5
5Google DeepMind logoGemini 1.5 Flash (May 2024)
60.4
8 modelli testati