Benchmarks

128 benchmarks dans 11 catégories. Cliquez pour voir le classement complet.

coding

Aider — Code Editing

coding

Claude 3.5 Sonnet

84.2

o1-preview

79.7

GPT-4o (2024-05-13)

72.9

GPT-4o (2024-08-06)

71.4

27 modèles testés

Aider polyglot

coding

Aider Polyglot · measures how well AI models can edit code across multiple programming languages using the Aider coding assistant framework.

GPT-5

88.0

GPT-5 Chat

88.0

o3 Pro

84.9

Gemini 2.5 Pro

83.1

Gemini 2.5 Pro Preview 06-05

83.1

53 modèles testés

CadEval

coding

CadEval · evaluates the ability to generate and reason about Computer-Aided Design code, testing spatial reasoning and engineering knowledge.

74.0

Gemini 2.5 Pro

64.0

o4 Mini

62.0

56.0

Claude 3.7 Sonnet

54.0

15 modèles testés

Cybench

coding

Cybench · evaluates AI on real Capture-The-Flag cybersecurity challenges, testing vulnerability analysis, exploitation, and security reasoning.

Claude Opus 4.6

93.0

Claude Opus 4.5

82.0

Claude Sonnet 4.5

60.0

Grok 4

43.0

Claude Opus 4.1

42.0

20 modèles testés

GSO-Bench

coding

GSO-Bench · evaluates AI models on real-world open-source software engineering tasks, testing the ability to understand and resolve actual GitHub issues.

Claude Opus 4.6

33.3

GPT-5.2

27.4

Claude Opus 4.5

26.5

Gemini 3 Pro

18.6

Claude Sonnet 4.5

14.7

18 modèles testés

LiveBench — Agentic Coding

coding

GPT-5.1-Codex-Max

56.7

GLM 5

55.0

GLM 5.1

55.0

Qwen3.6 Plus

55.0

GPT-5.1-Codex

53.3

29 modèles testés

LiveBench — Coding

coding

GPT-5.2-Codex

83.6

GPT-5.1-Codex-Max

81.4

Qwen3.6 Plus

78.2

GPT-5 Mini

76.1

DeepSeek V3.2

75.7

29 modèles testés

OpenCompass — LiveCodeBenchV6

coding

GLM 5

86.2

Step 3.5 Flash

83.9

GLM 4.7

83.8

Qwen3.5 397B A17B

83.0

DeepSeek V3.2 Speciale

80.9

32 modèles testés

SWE-bench Multilingual

coding

Claude Mythos Preview

87.3

1 modèles testés

SWE-bench Multimodal

coding

Claude Mythos Preview

59.0

1 modèles testés

SWE-bench Pro

coding

Claude Mythos Preview

77.8

1 modèles testés

SWE-Bench verified

coding

SWE-bench Verified · 500 human-validated tasks from 12 real Python repositories (Django, Flask, scikit-learn, sympy, and others). Each task requires the model to produce a git patch that resolves a real GitHub issue and passes the test suite. The verified subset eliminates ambiguous tasks from the original SWE-bench. Claude Mythos Preview leads at 93.9%, crossing 90% for the first time in 2026. Opus 4.6 scores 80.8%. The benchmark remains the most-cited evaluation for code-generation capability.

Claude Mythos Preview

93.9

Claude Opus 4.6

78.7

GPT-5.4

76.9

Claude Opus 4.5

76.7

Gemini 3.1 Pro Preview

75.6

23 modèles testés

SWE-Bench Verified (Bash Only)

coding

SWE-Bench Verified (Bash Only) · a curated subset of SWE-bench where models fix real Python repository bugs using only bash commands, no agent frameworks.

Claude Opus 4.5

74.4

GPT-5.2

71.8

Claude Sonnet 4.5

70.6

Claude Opus 4

67.6

GPT-5.1

66.0

19 modèles testés

Terminal Bench

coding

Terminal-Bench 2.0 · evaluates AI agents on real terminal-based coding tasks · writing scripts, debugging, running tests, and managing projects entirely through command-line interaction. Tests both code quality and terminal fluency. Claude Opus 4.7 scores 69.4%, demonstrating significant agentic terminal competence.

GPT-5.5

82.7

Claude Mythos Preview

82.0

Gemini 3.1 Pro Preview

78.4

GPT-5.3-Codex

77.3

Claude Opus 4.6

74.7

28 modèles testés

WeirdML

coding

WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns.

GPT-5.3-Codex

79.3

Claude Opus 4.6

77.9

GPT-5.2

72.2

Gemini 3.1 Pro Preview

72.1

Gemini 3 Pro

69.9

70 modèles testés

knowledge

ANLI

knowledge

ANLI (Adversarial NLI) · adversarially constructed natural language inference dataset where each round targets weaknesses found in previous model generations.

GPT-3.5 Turbo (older v0613)

37.1

phi-3-small 7.4B

37.1

Llama 3 8B Instruct

36.0

phi-3-medium 14B

33.7

Mixtral 8x7B Instruct

32.8

9 modèles testés

ARC AI2

knowledge

AI2 Reasoning Challenge · tests grade-school level science knowledge with multiple-choice questions requiring reasoning beyond simple retrieval.

DeepSeek V3

93.7

Llama 3.1 405B

93.7

Qwen2.5 72B Instruct

92.7

DeepSeek-V2 (MoE-236B, May 2024)

89.6

phi-3-medium 14B

88.8

35 modèles testés

AudioMultiChallenge

knowledge

Gemini 2.5 Pro

46.9

Gemini 2.5 Flash

40.0

2 modèles testés

AudioMultiChallenge — Audio Output

knowledge

0 modèles testés

AudioMultiChallenge — Text Output

knowledge

Gemini 2.5 Pro

46.9

Gemini 2.5 Flash

40.0

Voxtral Small 24B 2507

26.3

3 modèles testés

Balrog

knowledge

Balrog · benchmarks AI agents on text-based adventure games, testing language understanding, strategic planning, and long-horizon reasoning.

Gemini 3 Flash Preview

48.1

Grok 4

43.6

Gemini 2.5 Pro

43.3

34.9

Gemini 2.5 Flash

33.5

22 modèles testés

C-Eval

knowledge

GPT-4

68.7

LLaMA-13B

38.8

2 modèles testés

Chess Puzzles

knowledge

Chess Puzzles · tests strategic and tactical reasoning by having models solve chess puzzle positions, evaluating lookahead and pattern recognition abilities.

GPT-5.4 Pro

58.6

Gemini 3.1 Pro Preview

55.0

GPT-5.2

49.0

GPT-5.4

44.0

Gemini 3 Flash Preview

38.0

24 modèles testés

CMMLU

knowledge

Qwen2-72B

89.7

Qwen2.5 72B Instruct

85.7

GPT-4 Turbo

71.0

Llama 3.1 70B Instruct

64.4

Qwen-14B

58.7

8 modèles testés

CSQA2

knowledge

GPT-3.5 Turbo (older v0613)

14.0

Llama 2-13B

0.1

2 modèles testés

DeepResearch Bench

knowledge

DeepResearch Bench · evaluates AI on complex multi-step research tasks requiring information gathering, synthesis, and producing comprehensive analyses.

GPT-5

55.1

Claude Sonnet 4.5

52.6

Gemini 2.5 Pro

49.7

Claude Opus 4.1

49.7

Claude Opus 4

49.0

13 modèles testés

EnigmaEval

knowledge

Gemini 3.1 Pro Preview

19.8

13.1

2 modèles testés

Fiction.LiveBench

knowledge

Fiction.LiveBench · a continuously updated benchmark using recently published fiction to test reading comprehension and reasoning, preventing data contamination.

GPT-5

97.2

o3 Pro

97.2

Grok 4

94.4

Grok 4 Fast

94.4

Gemini 2.5 Pro

91.7

41 modèles testés

GeoBench

knowledge

GeoBench · tests geographic knowledge and spatial reasoning across countries, landmarks, coordinates, and geopolitical understanding.

Gemini 3 Flash Preview

88.0

Gemini 3 Pro

84.0

Gemini 2.5 Pro

81.0

GPT-5

81.0

80.0

26 modèles testés

GPQA

knowledge

Meta Llama 3 8B

19.7

Qwen2.5 72B Instruct Abliterated

19.4

Qwen2-72B

19.2

DeepSeek R1 Distill Qwen 14B

18.3

WizardLM-2 8x22B

17.6

73 modèles testés

GPQA diamond

knowledge

Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.

Claude Mythos Preview

94.5

GPT-5.5 Pro

94.2

GPT-5.5

93.6

GPT-5.4 Pro

92.8

Gemini 3.1 Pro Preview

92.1

98 modèles testés

HellaSwag

knowledge

HellaSwag · tests commonsense reasoning by asking models to predict the most plausible continuation of everyday scenarios.

GPT-4 Turbo

93.7

Llama 3.1 405B

85.6

Falcon-180B

85.3

DeepSeek V3

85.2

DeepSeek-V2 (MoE-236B, May 2024)

82.8

29 modèles testés

HELM — GPQA

knowledge

Gemini 3 Pro

80.3

GPT-5 Chat

79.1

GPT-5 Mini

75.6

75.3

Gemini 2.5 Pro

74.9

34 modèles testés

HELM — MMLU-Pro

knowledge

Gemini 3 Pro

90.3

Gemini 2.5 Pro

86.3

GPT-5 Chat

86.3

85.9

Grok 4

85.1

34 modèles testés

HLE

knowledge

HLE (Humanity's Last Exam) · a reasoning benchmark designed to be the hardest public evaluation of AI. Questions span mathematics, physics, philosophy, and logic · curated to be at or beyond the frontier of human expert capability. Tested with and without tool augmentation. Claude Opus 4.7 scores 46.9% without tools and 54.7% with tools · making it one of the few benchmarks where the top score is below 60%.

Claude Mythos Preview

56.8

Gemini 3 Pro

34.4

Claude Opus 4.6

31.1

GPT-5 Pro

28.2

GPT-5.2

24.2

23 modèles testés

Humanity's Last Exam

knowledge

0 modèles testés

Humanity's Last Exam (Text Only)

knowledge

0 modèles testés

LAMBADA

knowledge

LAMBADA · measures the ability to predict the final word of a passage, requiring broad contextual understanding across long text spans.

Falcon-180B

79.8

Llama 2-13B

76.5

LLaMA-13B

75.2

Baichuan 2-7B

73.3

Stable Beluga 2

71.3

7 modèles testés

Lech Mazur Writing

knowledge

Lech Mazur Writing · evaluates creative writing ability, assessing prose quality, narrative coherence, and stylistic sophistication.

GPT-5

87.2

Qwen3 Max

87.1

Kimi K2 0711

86.9

o3 Pro

86.3

Gemini 2.5 Pro

86.0

39 modèles testés

LiveBench — Overall

knowledge

GPT-5.2-Codex

74.3

GPT-5.1-Codex-Max

72.0

Qwen3.6 Plus

70.8

GLM 5.1

70.2

GLM 5

68.8

29 modèles testés

MMLU

knowledge

Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.

DeepSeek V3

82.9

Claude 3.5 Sonnet

82.0

GPT-4 (older v0314)

81.9

Llama 3.3 70B Instruct (free)

81.7

Qwen2.5 72B Instruct

80.4

67 modèles testés

MMLU-PRO

knowledge

Qwen2-72B

52.6

Qwen2.5 32B Instruct

51.9

Qwen2.5 72B Instruct

51.4

Qwen2.5 72B Instruct Abliterated

50.4

Phi 4

48.6

73 modèles testés

MMMLU

knowledge

Claude Mythos Preview

92.7

1 modèles testés

MultiChallenge

knowledge

Gemini 3.1 Pro Preview

71.4

1 modèles testés

MultiNRC

knowledge

Gemini 3.1 Pro Preview

64.7

1 modèles testés

OpenBookQA

knowledge

OpenBookQA · science questions that require combining a given core fact with broad common knowledge, mimicking an open-book exam setting.

phi-3-mini 3.8B

84.0

phi-3-small 7.4B

84.0

phi-3-medium 14B

83.2

GPT-3.5 Turbo (older v0613)

81.3

Mixtral 8x7B Instruct

81.1

19 modèles testés

OpenCompass — GPQA-Diamond

knowledge

Qwen3.5 397B A17B

88.4

Kimi K2.5

88.1

GLM 4.7

86.9

DeepSeek V3.2 Speciale

86.7

GLM 5

85.3

32 modèles testés

OpenCompass — HLE

knowledge

DeepSeek V3.2 Speciale

28.6

Kimi K2.5

28.6

GLM 5

28.1

Qwen3.5 397B A17B

27.5

GLM 4.7

25.4

32 modèles testés

OpenCompass — MMLU-Pro

knowledge

Qwen3.5 397B A17B

87.6

Kimi K2.5

86.2

DeepSeek V3.2

85.8

Gemini 2.5 Pro

85.8

DeepSeek V3.2 Speciale

85.5

32 modèles testés

PIQA

knowledge

PIQA (Physical Interaction QA) · tests intuitive physical reasoning by asking models to select the correct approach for everyday physical tasks.

GPT-4o-mini

77.4

GPT-4o-mini (2024-07-18)

77.4

Gemini 1.5 Flash (May 2024)

75.0

Llama 3.1 405B

71.8

Falcon-180B

69.8

25 modèles testés

PostTrainBench

knowledge

Claude Opus 4.6

23.2

Gemini 3.1 Pro Preview

21.6

GPT-5.2

21.4

GPT-5.4

20.2

Gemini 3 Pro

18.1

15 modèles testés

Professional Reasoning — Finance

knowledge

Claude Opus 4.6 (Fast)

53.3

GPT-5

51.3

GPT-5 Pro

51.1

o3 Pro

49.1

47.7

5 modèles testés

Professional Reasoning — Legal

knowledge

Claude Opus 4.6 (Fast)

52.3

GPT-5 Pro

49.9

o3 Pro

49.7

GPT-5

49.0

48.6

5 modèles testés

ScienceQA

knowledge

ScienceQA · multimodal science questions spanning natural science, social science, and language science with diverse question formats and image context.

GPT-4o (2024-05-13)

84.7

GPT-4o (2024-11-20)

84.7

Claude 3 Haiku

62.7

Llama 2-13B

41.0

LLaMA-13B

24.4

5 modèles testés

SciPredict

knowledge

Claude Opus 4.5

23.1

Gemini 3 Flash Preview

22.2

2 modèles testés

SimpleQA Verified

knowledge

SimpleQA Verified · short factual questions with verified answers, measuring factual accuracy and the tendency to hallucinate or provide incorrect information.

Gemini 3.1 Pro Preview

77.3

Gemini 3 Pro

72.9

Qwen3 Max

67.5

Gemini 3 Flash Preview

67.4

Muse Spark

66.3

32 modèles testés

TriviaQA

knowledge

TriviaQA · reading comprehension benchmark with trivia questions, requiring models to find and reason over evidence from provided documents.

Claude 2

87.5

GPT-3.5 Turbo (older v0613)

85.8

GPT-4 Turbo

84.8

DeepSeek V3

82.9

Llama 3.1 405B

82.7

20 modèles testés

TutorBench

knowledge

Gemini 2.5 Pro Preview 06-05

55.6

Kimi K2.5

54.6

2 modèles testés

VISTA

knowledge

Gemini 2.5 Pro Preview 06-05

54.6

o4 Mini

51.8

2 modèles testés

VisualToolBench (VTB)

knowledge

Gemini 3.1 Pro Preview

29.0

Claude Opus 4.6 (Fast)

27.5

2 modèles testés

VPCT

knowledge

VPCT (Visual Pattern Completion Test) · tests visual reasoning and pattern recognition by having models complete visual sequences and transformations.

Gemini 3 Pro

86.5

GPT-5.2

76.0

Gemini 3 Flash Preview

58.9

GPT-5

49.0

GPT-5.1

38.0

22 modèles testés

Winogrande

knowledge

WinoGrande · large-scale commonsense reasoning benchmark where models must resolve ambiguous pronouns in carefully constructed sentence pairs.

Llama 3.1 405B

78.4

Claude 3 Opus

77.0

GPT-4 (older v0314)

75.0

GPT-4 Turbo

75.0

Falcon-180B

74.2

38 modèles testés

agentic

APEX-Agents

agentic

APEX-Agents · evaluates AI agents on complex, multi-step tasks requiring planning, tool use, and autonomous decision-making in realistic environments.

GPT-5.4

35.9

GPT-5.2

34.3

Gemini 3.1 Pro Preview

33.5

Claude Opus 4.6

31.7

GPT-5.3-Codex

31.7

17 modèles testés

MCP Atlas

agentic

Claude Opus 4.5

62.3

Gemini 3 Flash Preview

57.4

2 modèles testés

OSWorld

agentic

OSWorld · tests AI agents on real-world computer tasks across operating systems, including web browsing, file management, and application use.

Claude Mythos Preview

79.6

GPT-5.5

78.7

Claude Opus 4.5

66.3

Kimi K2.5

63.3

Claude Sonnet 4.5

62.9

9 modèles testés

Remote Labor Index (RLI)

agentic

Claude Opus 4.6 (Fast)

4.2

1 modèles testés

SWE Atlas — Codebase QnA

agentic

Claude Opus 4.6 (Fast)

33.3

GPT-5.3-Codex

32.6

Claude Sonnet 4.6

31.2

3 modèles testés

SWE Atlas — Test Writing

agentic

Claude Opus 4.6 (Fast)

36.7

Claude Sonnet 4.6

31.8

2 modèles testés

SWE-Bench Pro (Private)

agentic

Claude Opus 4.5

23.4

Gemini 2.5 Pro Preview 06-05

10.1

2 modèles testés

SWE-Bench Pro (Public)

agentic

Claude Opus 4.5

45.9

GPT-5.2-Codex

41.0

2 modèles testés

The Agent Company

agentic

The Agent Company · tests AI agents on realistic corporate tasks like email management, code review, data analysis, and cross-tool workflows.

DeepSeek V3.2 Exp

42.9

Gemini 2.5 Flash

41.1

Claude Sonnet 4

33.1

Claude 3.7 Sonnet

30.9

Gemini 2.5 Pro

30.3

13 modèles testés

reasoning

ARC-AGI

reasoning

ARC-AGI · the original Abstraction and Reasoning Corpus, testing whether AI can solve novel visual pattern recognition tasks without memorization.

Gemini 3.1 Pro Preview

98.0

GPT-5.5

95.0

GPT-5.4 Pro

94.5

Claude Opus 4.6

94.0

GPT-5.4

93.7

49 modèles testés

ARC-AGI-2

reasoning

ARC-AGI-2 · the second iteration of the Abstraction and Reasoning Corpus, testing novel pattern recognition and abstract reasoning without prior training data.

GPT-5.4 Pro

83.3

Gemini 3.1 Pro Preview

77.1

GPT-5.4

74.0

Claude Opus 4.6

69.2

Claude Sonnet 4.6

60.4

50 modèles testés

BBH

reasoning

BIG-Bench Hard · a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average humans.

DeepSeek V3

83.3

Gemini 1.5 Pro (Feb 2024)

78.7

Llama 3.1 405B

77.2

phi-3-medium 14B

75.2

Qwen2.5 72B Instruct

73.1

24 modèles testés

CharXiv Reasoning

reasoning

CharXiv Reasoning · tests a model's ability to understand and reason about charts, plots, and figures extracted from real arXiv scientific papers. The model must answer questions that require reading axes, comparing data series, identifying trends, and drawing conclusions from visual scientific data. This benchmark specifically measures the intersection of visual understanding and scientific reasoning · a critical capability for research assistants and document analysis. Performance varies dramatically with and without tool use, making it a key differentiator for multimodal and agentic AI systems.

Claude Mythos Preview

86.1

1 modèles testés

CharXiv Reasoning (with tools)

reasoning

CharXiv Reasoning (with tools) · the tool-augmented variant of CharXiv Reasoning. Models can use code execution, image processing, and other tools to analyze charts from arXiv papers. Claude Mythos Preview scores 93.2% with tools vs 86.1% without · demonstrating how tool use dramatically improves visual scientific reasoning. The gap between tool-augmented and bare performance is a key signal for agent capability.

Claude Mythos Preview

93.2

1 modèles testés

GraphWalks BFS 256K-1M

reasoning

GraphWalks BFS 256K-1M · a long-context reasoning benchmark created by OpenAI that tests whether a model can perform breadth-first search (BFS) traversal across massive graphs encoded in 256,000 to 1,024,000 tokens of context. The model receives a graph represented as an edge list and must follow parent-child relationships across the entire extended context window. This is not simple retrieval · it requires true relational reasoning over hundreds of thousands of tokens. The dataset includes 100 problems at 256K context and 100 problems at 1,024K context. Claude Mythos Preview leads at 80.0%, more than doubling Opus 4.6 (38.7%) and far exceeding GPT-5.4 (21.4%). The massive performance gap between models makes this one of the most discriminating benchmarks for real long-context capability in 2026.

Claude Mythos Preview

80.0

1 modèles testés

HELM — WildBench

reasoning

GPT-5.1

86.3

Kimi K2 0711

86.2

86.1

Gemini 3 Pro

85.9

Gemini 2.5 Pro

85.7

34 modèles testés

HLE (with tools)

reasoning

Claude Mythos Preview

64.7

1 modèles testés

LiveBench — Data Analysis

reasoning

GPT-5.2-Codex

78.2

Qwen3.6 Plus

69.9

GLM 5

67.9

GLM 5.1

63.2

GPT-5.1-Codex

60.8

29 modèles testés

LiveBench — Reasoning

reasoning

GPT-5.1-Codex-Max

84.6

GPT-5.1-Codex

82.0

GPT-5.2-Codex

77.7

Qwen3.6 Plus

75.8

MiniMax M2.7

74.8

29 modèles testés

MUSR

reasoning

DeepSeek R1 Distill Qwen 14B

28.7

Hermes 3 70B Instruct

23.4

Llama 3 8B Instruct

19.9

Qwen2-72B

19.7

Stable Beluga 2

18.6

73 modèles testés

SimpleBench

reasoning

SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.

Gemini 3.1 Pro Preview

75.5

Gemini 3 Pro

71.7

GPT-5.4 Pro

68.9

Claude Opus 4.6

61.1

Gemini 2.5 Pro

54.9

52 modèles testés

speed

Artificial Analysis — Agentic Index

speed

Artificial Analysis Agentic Index · a composite score measuring how well a model performs in agentic workflows · multi-step tool use, planning, error recovery, and autonomous task completion. Aggregates results from multiple agentic benchmarks including SWE-bench, tool-use tests, and planning evaluations. The canonical single-number metric for "how good is this model as an agent?"

GPT-5.4

69.4

Claude Opus 4.6 (Fast)

67.6

GLM 5.1

67.0

GLM 5 Turbo

63.1

Claude Sonnet 4.6

63.0

66 modèles testés

Artificial Analysis — Coding Index

speed

Artificial Analysis Coding Index · a composite score that aggregates performance across multiple coding benchmarks into a single index. Tracks code generation quality, debugging ability, multi-language competence, and real-world software engineering tasks. Used by Artificial Analysis to rank model coding capability in a normalized, comparable format. Useful for developers choosing between models for coding-heavy workloads.

GPT-5.4

57.3

Gemini 3.1 Pro Preview

55.5

GPT-5.3-Codex

53.1

GPT-5.4 Mini

51.5

Claude Sonnet 4.6

50.9

67 modèles testés

Artificial Analysis — Quality Index

speed

Gemini 3.1 Pro Preview

57.2

GPT-5.4

57.2

GPT-5.3-Codex

54.0

Claude Opus 4.6 (Fast)

53.0

Muse Spark

52.1

68 modèles testés

general

BBH (HuggingFace)

general

Qwen2.5 72B Instruct

61.9

Qwen2.5 72B Instruct Abliterated

60.5

Llama 3.3 70B Instruct

56.6

Qwen2.5 32B Instruct

56.5

Llama 3.1 70B Instruct

55.9

73 modèles testés

arena

Chatbot Arena Elo — Coding

arena

Claude Opus 4.6 (Fast)

1546.2

Claude Opus 4.6

1542.9

Claude Sonnet 4.6

1521.0

Claude Opus 4.5

1465.2

Gemini 3.1 Pro Preview

1455.7

27 modèles testés

Chatbot Arena Elo — Overall

arena

Claude Opus 4.6 (Fast)

1502.8

Claude Opus 4.6

1496.6

Gemini 3.1 Pro Preview

1492.6

Gemini 3 Pro

1486.2

Gemini 3 Flash Preview

1473.9

113 modèles testés

safety

Fortress

safety

Claude Opus 4.5

13.6

Claude 3.5 Sonnet

13.0

gpt-oss-120b

8.2

3 modèles testés

MASK

safety

Claude Opus 4.6 (Fast)

96.3

Claude Sonnet 4

95.3

2 modèles testés

PropensityBench

safety

Qwen2.5 32B Instruct

22.9

1 modèles testés

math

FrontierMath-2025-02-28-Private

math

FrontierMath (Feb 2025) · original research-level math problems created by mathematicians, testing capabilities at the boundary of current AI mathematical reasoning.

GPT-5.4 Pro

50.0

GPT-5.4

47.6

Claude Opus 4.6

40.7

GPT-5.2

40.7

Muse Spark

39.0

54 modèles testés

FrontierMath-Tier-4-2025-07-01-Private

math

FrontierMath Tier 4 (Jul 2025) · the most challenging tier of frontier mathematics, containing problems that push the absolute limits of AI mathematical reasoning.

GPT-5.4 Pro

37.5

GPT-5.2 Pro

31.3

GPT-5.4

27.1

Claude Opus 4.6

22.9

GPT-5.2

18.8

37 modèles testés

GSM8K

math

Grade School Math 8K · 8,500 linguistically diverse grade-school math word problems that require multi-step reasoning to solve.

GPT-4 (older v0314)

92.0

GPT-4o-mini

91.3

GPT-4o-mini (2024-07-18)

91.3

Qwen2.5 Coder 32B Instruct

91.1

GPT-4 Turbo

90.0

32 modèles testés

HELM — Omni-MATH

math

GPT-5 Mini

72.2

o4 Mini

72.0

71.4

gpt-oss-120b

68.8

Kimi K2 0711

65.4

34 modèles testés

LiveBench — Mathematics

math

GPT-5.2-Codex

88.8

GLM 5.1

84.9

Qwen3.6 Plus

83.7

GPT-5.1-Codex-Max

83.7

GLM 5

83.5

29 modèles testés

MATH level 5

math

MATH Level 5 · the hardest tier of the MATH benchmark, featuring competition-level problems from AMC, AIME, and Olympiad-style mathematics.

GPT-5

98.1

GPT-5 Mini

97.8

o4 Mini

97.8

Claude Sonnet 4.5

97.7

72 modèles testés

MATH Level 5

math

Qwen2.5 32B Instruct

62.5

Qwen2.5 72B Instruct Abliterated

60.1

Qwen2.5 72B Instruct

59.8

DeepSeek R1 Distill Qwen 14B

57.0

Qwen2.5 14B Instruct

55.3

73 modèles testés

OpenCompass — AIME2025

math

DeepSeek V3.2 Speciale

96.0

GLM 5

95.8

Step 3.5 Flash

95.7

GLM 4.7

95.4

Kimi K2 Thinking

94.1

32 modèles testés

OTIS Mock AIME 2024-2025

math

OTIS Mock AIME 2024-2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.

GPT-5.2

96.1

Gemini 3.1 Pro Preview

95.6

GPT-5.4

95.3

Claude Opus 4.6

94.4

Gemini 3 Flash Preview

92.8

86 modèles testés

USAMO

math

Claude Mythos Preview

97.6

1 modèles testés

language

HELM — IFEval

language

Grok 3 Mini Beta

95.1

Grok 4

94.9

GPT-5.1

93.5

GPT-5 Nano

93.2

o4 Mini

92.9

34 modèles testés

IFEval

language

Llama 3.3 70B Instruct

90.0

Llama 3.1 70B Instruct

86.7

Qwen2.5 72B Instruct

86.4

Qwen2.5 72B Instruct Abliterated

85.9

Qwen2.5 32B Instruct

83.5

73 modèles testés

JCommonsenseQA

language

DeepSeek R1 Distill Qwen 14B

93.7

Qwen2 7B Instruct

89.1

Qwen2 VL 7B Instruct

87.8

Meta Llama 3 8B Instruct

87.7

Meta Llama 3 8B

82.9

11 modèles testés

JHumanEval

language

0 modèles testés

JMMLU

language

DeepSeek R1 Distill Qwen 14B

63.4

Qwen2 7B Instruct

56.5

Qwen2 VL 7B Instruct

56.3

Meta Llama 3 8B Instruct

46.7

Meta Llama 3 8B

44.7

11 modèles testés

JNLI

language

DeepSeek R1 Distill Qwen 14B

82.4

Qwen2 7B Instruct

81.3

Qwen2 VL 7B Instruct

74.4

DeepSeek R1 Distill Llama 8B

69.4

Meta Llama 3 8B Instruct

61.1

11 modèles testés

JSQuAD

language

Qwen2 VL 7B Instruct

89.9

DeepSeek R1 Distill Qwen 14B

89.8

Qwen2 7B Instruct

89.6

Meta Llama 3 8B Instruct

89.5

Meta Llama 3 8B

88.9

11 modèles testés

LiveBench — If

language

GLM 5.1

68.5

Gemma 4 31B

67.6

GPT-5.1-Codex-Max

67.1

GPT-5.2-Codex

66.5

GPT-5 Mini

64.2

29 modèles testés

LiveBench — Language

language

GLM 5

77.5

GPT-5.1-Codex-Max

75.4

Qwen3.6 Plus

75.0

GPT-5.2-Codex

73.7

GLM 5.1

71.8

29 modèles testés

LLM-JP — Overall

language

DeepSeek R1 Distill Qwen 14B

56.8

Qwen2 VL 7B Instruct

53.0

Qwen2 7B Instruct

51.7

Meta Llama 3 8B Instruct

49.6

Meta Llama 3 8B

48.9

11 modèles testés

MMMLU — Arabic

language

Qwen2 7B Instruct

50.7

Meta Llama 3 8B Instruct

40.5

2 modèles testés

MMMLU — Bengali

language

Qwen2 7B Instruct

43.4

Meta Llama 3 8B Instruct

36.4

2 modèles testés

MMMLU — Chinese

language

Qwen2 7B Instruct

61.8

Meta Llama 3 8B Instruct

51.4

2 modèles testés

MMMLU — French

language

Qwen2 7B Instruct

60.8

Meta Llama 3 8B Instruct

55.8

2 modèles testés

MMMLU — German

language

Qwen2 7B Instruct

57.1

Meta Llama 3 8B Instruct

53.5

2 modèles testés

MMMLU — Hindi

language

Qwen2 7B Instruct

45.1

Meta Llama 3 8B Instruct

41.4

2 modèles testés

MMMLU — Indonesian

language

Qwen2 7B Instruct

54.1

Meta Llama 3 8B Instruct

51.0

2 modèles testés

MMMLU — Italian

language

Qwen2 7B Instruct

59.0

Meta Llama 3 8B Instruct

53.3

2 modèles testés

MMMLU — Japanese

language

Qwen2 7B Instruct

56.6

Meta Llama 3 8B Instruct

42.3

2 modèles testés

MMMLU — Korean

language

Qwen2 7B Instruct

54.0

Meta Llama 3 8B Instruct

46.5

2 modèles testés

MMMLU — Portuguese

language

Qwen2 7B Instruct

60.1

Meta Llama 3 8B Instruct

55.5

2 modèles testés

MMMLU — Spanish

language

Qwen2 7B Instruct

60.2

Meta Llama 3 8B Instruct

55.8

2 modèles testés

MMMLU — Swahili

language

Meta Llama 3 8B Instruct

37.5

Qwen2 7B Instruct

34.3

2 modèles testés

MMMLU — Yoruba

language

Meta Llama 3 8B Instruct

31.0

Qwen2 7B Instruct

30.2

2 modèles testés

OpenCompass — IFEval

language

Kimi K2.5

93.9

GLM 5

93.2

Step 3.5 Flash

93.2

Kimi K2 Thinking

92.4

DeepSeek V3.2 Speciale

91.7

32 modèles testés

multimodal

VideoMME

multimodal

VideoMME · multimodal benchmark testing video understanding across diverse domains, requiring temporal reasoning and cross-frame comprehension.

Gemini 1.5 Pro (Feb 2024)

66.7

Qwen2.5 72B Instruct

64.7

GPT-4o (2024-08-06)

62.5

GPT-4o (2024-11-20)

62.5

Gemini 1.5 Flash (May 2024)

60.4

8 modèles testés