Beta

Benchmarks

40 benchmarks dans 6 catégories. Cliquez pour voir le classement complet.

knowledge

ARC AI2

knowledge

AI2 Reasoning Challenge — tests grade-school level science knowledge with multiple-choice questions requiring reasoning beyond simple retrieval.

1DeepSeekDeepSeek V3
93.7
2MetaLlama 3.1-405B
93.7
3Alibaba QwenQwen2.5 72B Instruct
92.7
4DeepSeekDeepSeek-V2 (MoE-236B, May 2024)
89.6
5Microsoftphi-3-medium 14B
88.8
48 modèles testés

HellaSwag

knowledge

HellaSwag — tests commonsense reasoning by asking models to predict the most plausible continuation of everyday scenarios.

1MetaLlama 3.1-405B
85.6
2
T
Falcon-180B
85.3
3DeepSeekDeepSeek V3
85.2
4DeepSeekDeepSeek-V2 (MoE-236B, May 2024)
82.8
5Mistral AIMixtral 8x7B Instruct
82.3
37 modèles testés

LAMBADA

knowledge

LAMBADA — measures the ability to predict the final word of a passage, requiring broad contextual understanding across long text spans.

1
T
Falcon-180B
79.8
2MetaLlama 2-70B
78.9
3MetaLLaMA-65B
77.7
4
T
Falcon-40B
77.3
5MetaLLaMA-33B
77.2
16 modèles testés

MMLU

knowledge

Massive Multitask Language Understanding — 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.

1OpenAIGPT-4o (2024-11-20)
84.1
2DeepSeekDeepSeek V3
82.9
3GoogleGemini 1.5 Pro (Sept 2024)
82.5
4AnthropicClaude 3.5 Sonnet
82.0
5MetaLlama 3.3 70B Instruct (free)
81.7
92 modèles testés

GPQA diamond

knowledge

Graduate-Level Google-Proof QA (Diamond set) — expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.

1Google DeepMindGemini 3.1 Pro Preview
92.1
2OpenAIGPT-5.4
91.1
3GoogleGemini 3 Pro
90.2
4OpenAIGPT-5.2 Chat
88.5
5OpenAIGPT-5.2
88.5
115 modèles testés

Winogrande

knowledge

WinoGrande — large-scale commonsense reasoning benchmark where models must resolve ambiguous pronouns in carefully constructed sentence pairs.

1MetaLlama 3.1-405B
78.4
2AnthropicClaude 3 Opus
77.0
3
T
Falcon-180B
74.2
4DeepSeekDeepSeek-V2 (MoE-236B, May 2024)
72.6
5DeepSeekDeepSeek V3
70.4
47 modèles testés

Lech Mazur Writing

knowledge

Lech Mazur Writing — evaluates creative writing ability, assessing prose quality, narrative coherence, and stylistic sophistication.

1
M
Kimi K2 0905
87.3
2OpenAIGPT-5 Chat
87.2
3OpenAIGPT-5
87.2
4Alibaba QwenQwen3 Max
87.1
5
M
Kimi K2 0711
86.9
49 modèles testés

Fiction.LiveBench

knowledge

Fiction.LiveBench — a continuously updated benchmark using recently published fiction to test reading comprehension and reasoning, preventing data contamination.

1OpenAIGPT-5 Chat
97.2
2OpenAIGPT-5
97.2
3OpenAIo3 Pro
97.2
4xAIGrok 4 Fast
94.4
5xAIGrok 4
94.4
53 modèles testés

SimpleQA Verified

knowledge

SimpleQA Verified — short factual questions with verified answers, measuring factual accuracy and the tendency to hallucinate or provide incorrect information.

1Google DeepMindGemini 3.1 Pro Preview
77.3
2GoogleGemini 3 Pro
72.9
3Alibaba QwenQwen3 Max
67.5
4Google DeepMindGemini 3 Flash Preview
67.4
5Google DeepMindGemini 2.5 Pro
56.0
36 modèles testés

Chess Puzzles

knowledge

Chess Puzzles — tests strategic and tactical reasoning by having models solve chess puzzle positions, evaluating lookahead and pattern recognition abilities.

1Google DeepMindGemini 3.1 Pro Preview
55.0
2OpenAIGPT-5.2 Chat
49.0
3OpenAIGPT-5.2
49.0
4Google DeepMindGemini 3 Flash Preview
38.0
5OpenAIGPT-5 Chat
37.0
29 modèles testés

HLE

knowledge

HLE (Humanity's Last Exam) — crowdsourced expert-level questions designed to be among the hardest possible challenges for AI systems across all domains.

1GoogleGemini 3 Pro
34.4
2AnthropicClaude Opus 4.6
31.1
3OpenAIGPT-5 Pro
28.2
4OpenAIGPT-5.2 Chat
24.2
5OpenAIGPT-5.2
24.2
27 modèles testés

TriviaQA

knowledge

TriviaQA — reading comprehension benchmark with trivia questions, requiring models to find and reason over evidence from provided documents.

1MetaLlama 2-70B
87.6
2AnthropicClaude 2
87.5
3MetaLLaMA-65B
86.0
4OpenAIGPT-4 Turbo
84.8
5OpenAIGPT-4 Turbo (older v1106)
84.8
31 modèles testés

ScienceQA

knowledge

ScienceQA — multimodal science questions spanning natural science, social science, and language science with diverse question formats and image context.

1AnthropicClaude 3 Haiku
62.7
2MetaLlama 2-13B
41.0
3MetaLLaMA-13B
24.4
4MetaLlama 2-7B
24.1
5MetaLLaMA-7B
14.9
5 modèles testés

PIQA

knowledge

PIQA (Physical Interaction QA) — tests intuitive physical reasoning by asking models to select the correct approach for everyday physical tasks.

1OpenAIGPT-4o-mini (2024-07-18)
77.4
2OpenAIGPT-4o-mini
77.4
3GoogleGemini 1.5 Flash (Sep 2024)
75.0
4MetaLlama 3.1-405B
71.8
5
T
Falcon-180B
69.8
36 modèles testés

OpenBookQA

knowledge

OpenBookQA — science questions that require combining a given core fact with broad common knowledge, mimicking an open-book exam setting.

1Microsoftphi-3-mini 3.8B
84.0
2Microsoftphi-3-small 7.4B
84.0
3Microsoftphi-3-medium 14B
83.2
4Mistral AIMixtral 8x7B Instruct
81.1
5MetaLlama 3 8B Instruct
76.8
27 modèles testés

Balrog

knowledge

Balrog — benchmarks AI agents on text-based adventure games, testing language understanding, strategic planning, and long-horizon reasoning.

1Google DeepMindGemini 3 Flash Preview
48.1
2xAIGrok 4
43.6
3DeepSeekDeepSeek-R1
34.9
4OpenAIGPT-5 Chat
32.8
5OpenAIGPT-5
32.8
20 modèles testés

GeoBench

knowledge

GeoBench — tests geographic knowledge and spatial reasoning across countries, landmarks, coordinates, and geopolitical understanding.

1Google DeepMindGemini 3 Flash Preview
88.0
2GoogleGemini 3 Pro
84.0
3OpenAIGPT-5 Chat
81.0
4OpenAIGPT-5
81.0
5OpenAIo1
80.0
29 modèles testés

ANLI

knowledge

ANLI (Adversarial NLI) — adversarially constructed natural language inference dataset where each round targets weaknesses found in previous model generations.

1Microsoftphi-3-small 7.4B
37.1
2MetaLlama 3 8B Instruct
36.0
3Microsoftphi-3-medium 14B
33.7
4Mistral AIMixtral 8x7B Instruct
32.8
5Microsoftphi-3-mini 3.8B
29.2
8 modèles testés

DeepResearch Bench

knowledge

DeepResearch Bench — evaluates AI on complex multi-step research tasks requiring information gathering, synthesis, and producing comprehensive analyses.

1AnthropicClaude Sonnet 4.5
52.6
2OpenAIGPT-5 Chat
51.0
3OpenAIGPT-5
51.0
4AnthropicClaude Opus 4.1
49.7
5AnthropicClaude Opus 4
49.0
12 modèles testés

VPCT

knowledge

VPCT (Visual Pattern Completion Test) — tests visual reasoning and pattern recognition by having models complete visual sequences and transformations.

1GoogleGemini 3 Pro
86.5
2OpenAIGPT-5.2 Chat
76.0
3OpenAIGPT-5.2
76.0
4Google DeepMindGemini 3 Flash Preview
58.9
5OpenAIGPT-5 Chat
49.0
26 modèles testés

reasoning

BBH

reasoning

BIG-Bench Hard — a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average humans.

1DeepSeekDeepSeek V3
83.3
2MetaLlama 3.1-405B
77.2
3Microsoftphi-3-medium 14B
75.2
4Alibaba QwenQwen2.5 72B Instruct
73.1
5Microsoftphi-3-small 7.4B
72.1
37 modèles testés

SimpleBench

reasoning

SimpleBench — tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.

1Google DeepMindGemini 3.1 Pro Preview
75.5
2GoogleGemini 3 Pro
71.7
3OpenAIGPT-5.4 Pro
68.9
4AnthropicClaude Opus 4.6
61.1
5Google DeepMindGemini 2.5 Pro
54.9
61 modèles testés

ARC-AGI-2

reasoning

ARC-AGI-2 — the second iteration of the Abstraction and Reasoning Corpus, testing novel pattern recognition and abstract reasoning without prior training data.

1OpenAIGPT-5.4 Pro
83.3
2Google DeepMindGemini 3.1 Pro Preview
77.1
3OpenAIGPT-5.4
74.0
4AnthropicClaude Opus 4.6
69.2
5AnthropicClaude Sonnet 4.6
60.4
52 modèles testés

ARC-AGI

reasoning

ARC-AGI — the original Abstraction and Reasoning Corpus, testing whether AI can solve novel visual pattern recognition tasks without memorization.

1Google DeepMindGemini 3.1 Pro Preview
98.0
2AnthropicClaude Opus 4.6
94.0
3OpenAIGPT-5.2 Chat
86.2
4OpenAIGPT-5.2
86.2
5AnthropicClaude Opus 4.5
80.0
37 modèles testés

math

GSM8K

math

Grade School Math 8K — 8,500 linguistically diverse grade-school math word problems that require multi-step reasoning to solve.

1OpenAIGPT-4o-mini (2024-07-18)
91.3
2OpenAIGPT-4o-mini
91.3
3Alibaba QwenQwen2.5 Coder 32B Instruct
91.1
4OpenAIGPT-4 Turbo
90.0
5OpenAIGPT-4 Turbo (older v1106)
90.0
48 modèles testés

MATH level 5

math

MATH Level 5 — the hardest tier of the MATH benchmark, featuring competition-level problems from AMC, AIME, and Olympiad-style mathematics.

1OpenAIGPT-5 Chat
98.1
2OpenAIGPT-5
98.1
3OpenAIGPT-5 Mini
97.8
4OpenAIo4 Mini
97.8
5OpenAIo3
97.8
89 modèles testés

OTIS Mock AIME 2024-2025

math

OTIS Mock AIME 2024–2025 — simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.

1OpenAIGPT-5.2 Chat
96.1
2OpenAIGPT-5.2
96.1
3Google DeepMindGemini 3.1 Pro Preview
95.6
4OpenAIGPT-5.4
95.3
5AnthropicClaude Opus 4.6
94.4
105 modèles testés

FrontierMath-2025-02-28-Private

math

FrontierMath (Feb 2025) — original research-level math problems created by mathematicians, testing capabilities at the boundary of current AI mathematical reasoning.

1OpenAIGPT-5.4 Pro
50.0
2OpenAIGPT-5.4
47.6
3AnthropicClaude Opus 4.6
40.7
4OpenAIGPT-5.2 Chat
40.7
5OpenAIGPT-5.2
40.7
60 modèles testés

FrontierMath-Tier-4-2025-07-01-Private

math

FrontierMath Tier 4 (Jul 2025) — the most challenging tier of frontier mathematics, containing problems that push the absolute limits of AI mathematical reasoning.

1OpenAIGPT-5.4 Pro
37.5
2OpenAIGPT-5.4
27.1
3AnthropicClaude Opus 4.6
22.9
4OpenAIGPT-5.2 Chat
18.8
5OpenAIGPT-5.2
18.8
39 modèles testés

coding

WeirdML

coding

WeirdML — tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns.

1AnthropicClaude Opus 4.6
77.9
2OpenAIGPT-5.2 Chat
72.2
3OpenAIGPT-5.2
72.2
4Google DeepMindGemini 3.1 Pro Preview
72.1
5GoogleGemini 3 Pro
69.9
87 modèles testés

Aider polyglot

coding

Aider Polyglot — measures how well AI models can edit code across multiple programming languages using the Aider coding assistant framework.

1OpenAIGPT-5 Chat
88.0
2OpenAIGPT-5
88.0
3OpenAIo3 Pro
84.9
4Google DeepMindGemini 2.5 Pro
83.1
5OpenAIo3
81.3
55 modèles testés

GSO-Bench

coding

GSO-Bench — evaluates AI models on real-world open-source software engineering tasks, testing the ability to understand and resolve actual GitHub issues.

1AnthropicClaude Opus 4.6
33.3
2OpenAIGPT-5.2 Chat
27.4
3OpenAIGPT-5.2
27.4
4AnthropicClaude Opus 4.5
26.5
5GoogleGemini 3 Pro
18.6
23 modèles testés

SWE-Bench Verified (Bash Only)

coding

SWE-Bench Verified (Bash Only) — a curated subset of SWE-bench where models fix real Python repository bugs using only bash commands, no agent frameworks.

1AnthropicClaude Opus 4.5
74.4
2GoogleGemini 3 Pro
74.2
3OpenAIGPT-5.2 Chat
71.8
4OpenAIGPT-5.2
71.8
5AnthropicClaude Sonnet 4.5
70.6
32 modèles testés

Terminal Bench

coding

Terminal Bench — tests the ability to accomplish real-world tasks using terminal commands, evaluating shell scripting and CLI tool proficiency.

1Google DeepMindGemini 3.1 Pro Preview
78.4
2AnthropicClaude Opus 4.6
69.9
3OpenAIGPT-5.2 Chat
64.9
4OpenAIGPT-5.2
64.9
5Google DeepMindGemini 3 Flash Preview
64.3
27 modèles testés

CadEval

coding

CadEval — evaluates the ability to generate and reason about Computer-Aided Design code, testing spatial reasoning and engineering knowledge.

1OpenAIo3
74.0
2OpenAIo4 Mini
62.0
3OpenAIo1
56.0
4AnthropicClaude 3.7 Sonnet
54.0
5AnthropicClaude 3.7 Sonnet (thinking)
54.0
15 modèles testés

Cybench

coding

Cybench — evaluates AI on real Capture-The-Flag cybersecurity challenges, testing vulnerability analysis, exploitation, and security reasoning.

1AnthropicClaude Sonnet 4.5
55.0
2AnthropicClaude Opus 4.1
38.0
3AnthropicClaude Opus 4
38.0
4AnthropicClaude Sonnet 4
35.0
5OpenAIo3 Mini
22.5
17 modèles testés

agentic

APEX-Agents

agentic

APEX-Agents — evaluates AI agents on complex, multi-step tasks requiring planning, tool use, and autonomous decision-making in realistic environments.

1OpenAIGPT-5.4
35.9
2OpenAIGPT-5.2 Chat
34.3
3OpenAIGPT-5.2
34.3
4Google DeepMindGemini 3.1 Pro Preview
33.5
5AnthropicClaude Opus 4.6
31.7
21 modèles testés

OSWorld

agentic

OSWorld — tests AI agents on real-world computer tasks across operating systems, including web browsing, file management, and application use.

1AnthropicClaude Opus 4.5
66.3
2
M
Kimi K2.5
63.3
3AnthropicClaude Sonnet 4.5
62.9
4AnthropicClaude Sonnet 4
43.9
5AnthropicClaude 3.7 Sonnet
35.8
8 modèles testés

The Agent Company

agentic

The Agent Company — tests AI agents on realistic corporate tasks like email management, code review, data analysis, and cross-tool workflows.

1DeepSeekDeepSeek V3.2 Exp
42.9
2AnthropicClaude Sonnet 4
33.1
3AnthropicClaude 3.7 Sonnet
30.9
4AnthropicClaude 3.7 Sonnet (thinking)
30.9
5OpenAIGPT-4o (2024-11-20)
8.6
10 modèles testés

multimodal

VideoMME

multimodal

VideoMME — multimodal benchmark testing video understanding across diverse domains, requiring temporal reasoning and cross-frame comprehension.

1GoogleGemini 1.5 Pro (Feb 2024)
66.7
2Alibaba QwenQwen2.5 72B Instruct
64.7
3OpenAIGPT-4o (2024-11-20)
62.5
4OpenAIGPT-4o (2024-08-06)
62.5
5OpenAIGPT-4o (2024-05-13)
62.5
11 modèles testés