Benchmark

128 benchmark in 11 categorie. Selezionare un benchmark per visualizzare la classifica completa.

coding

Aider — Code Editing

coding

84.2

Claude 3.5 Sonnet

84.2

o1-preview

79.7

GPT-4o (2024-05-13)

72.9

GPT-4o (2024-11-20)

71.4

27 modelli testati

Aider polyglot

coding

Aider Polyglot · measures how well AI models can edit code across multiple programming languages using the Aider coding assistant framework.

GPT-5 Chat

88.0

GPT-5

88.0

o3 Pro

84.9

Gemini 2.5 Pro

83.1

Gemini 2.5 Pro Preview 06-05

83.1

53 modelli testati

CadEval

coding

CadEval · evaluates the ability to generate and reason about Computer-Aided Design code, testing spatial reasoning and engineering knowledge.

74.0

Gemini 2.5 Pro

64.0

o4 Mini

62.0

56.0

o3 Mini

54.0

15 modelli testati

Cybench

coding

Cybench · evaluates AI on real Capture-The-Flag cybersecurity challenges, testing vulnerability analysis, exploitation, and security reasoning.

Claude Opus 4.6

93.0

Claude Opus 4.5

82.0

Claude Sonnet 4.5

60.0

Grok 4

43.0

Claude Opus 4.1

42.0

20 modelli testati

GSO-Bench

coding

GSO-Bench · evaluates AI models on real-world open-source software engineering tasks, testing the ability to understand and resolve actual GitHub issues.

Claude Opus 4.6

33.3

GPT-5.2

27.4

Claude Opus 4.5

26.5

Gemini 3 Pro

18.6

Claude Sonnet 4.5

14.7

18 modelli testati

LiveBench — Agentic Coding

coding

GPT-5.1-Codex-Max

56.7

GLM 5.1

55.0

Qwen3.6 Plus

55.0

GLM 5

55.0

GPT-5.1-Codex

53.3

29 modelli testati

LiveBench — Coding

coding

GPT-5.2-Codex

83.6

GPT-5.1-Codex-Max

81.4

Qwen3.6 Plus

78.2

GPT-5 Mini

76.1

DeepSeek V3.2

75.7

29 modelli testati

OpenCompass — LiveCodeBenchV6

coding

GLM 5

86.2

Step 3.5 Flash

83.9

GLM 4.7

83.8

Qwen3.5 397B A17B

83.0

DeepSeek V3.2 Speciale

80.9

32 modelli testati

SWE-bench Multilingual

coding

Claude Mythos Preview

87.3

1 modelli testati

SWE-bench Multimodal

coding

Claude Mythos Preview

59.0

1 modelli testati

SWE-bench Pro

coding

Claude Mythos Preview

77.8

1 modelli testati

SWE-Bench verified

coding

SWE-bench Verified · 500 human-validated tasks from 12 real Python repositories (Django, Flask, scikit-learn, sympy, and others). Each task requires the model to produce a git patch that resolves a real GitHub issue and passes the test suite. The verified subset eliminates ambiguous tasks from the original SWE-bench. Claude Mythos Preview leads at 93.9%, crossing 90% for the first time in 2026. Opus 4.6 scores 80.8%. The benchmark remains the most-cited evaluation for code-generation capability.

Claude Mythos Preview

93.9

Claude Opus 4.6

78.7

GPT-5.4

76.9

Claude Opus 4.5

76.7

Gemini 3.1 Pro Preview

75.6

23 modelli testati

SWE-Bench Verified (Bash Only)

coding

SWE-Bench Verified (Bash Only) · a curated subset of SWE-bench where models fix real Python repository bugs using only bash commands, no agent frameworks.

Claude Opus 4.5

74.4

GPT-5.2

71.8

Claude Sonnet 4.5

70.6

Claude Opus 4

67.6

GPT-5.1

66.0

19 modelli testati

Terminal Bench

coding

Terminal-Bench 2.0 · evaluates AI agents on real terminal-based coding tasks · writing scripts, debugging, running tests, and managing projects entirely through command-line interaction. Tests both code quality and terminal fluency. Claude Opus 4.7 scores 69.4%, demonstrating significant agentic terminal competence.

Claude Mythos Preview

82.0

Gemini 3.1 Pro Preview

78.4

GPT-5.3-Codex

77.3

Claude Opus 4.6

74.7

Gemini 3 Pro

69.4

27 modelli testati

WeirdML

coding

WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns.

GPT-5.3-Codex

79.3

Claude Opus 4.6

77.9

GPT-5.2

72.2

Gemini 3.1 Pro Preview

72.1

Gemini 3 Pro

69.9

70 modelli testati

knowledge

ANLI

knowledge

ANLI (Adversarial NLI) · adversarially constructed natural language inference dataset where each round targets weaknesses found in previous model generations.

phi-3-small 7.4B

37.1

GPT-3.5 Turbo (older v0613)

37.1

Llama 3 8B Instruct

36.0

phi-3-medium 14B

33.7

Mixtral 8x7B Instruct

32.8

9 modelli testati

ARC AI2

knowledge

AI2 Reasoning Challenge · tests grade-school level science knowledge with multiple-choice questions requiring reasoning beyond simple retrieval.

Llama 3.1 405B

93.7

DeepSeek V3

93.7

Qwen2.5 72B Instruct

92.7

DeepSeek-V2 (MoE-236B, May 2024)

89.6

phi-3-medium 14B

88.8

35 modelli testati

AudioMultiChallenge

knowledge

Gemini 2.5 Pro

46.9

Gemini 2.5 Flash

40.0

2 modelli testati

AudioMultiChallenge — Audio Output

knowledge

0 modelli testati

AudioMultiChallenge — Text Output

knowledge

Gemini 2.5 Pro

46.9

Gemini 2.5 Flash

40.0

Voxtral Small 24B 2507

26.3

3 modelli testati

Balrog

knowledge

Balrog · benchmarks AI agents on text-based adventure games, testing language understanding, strategic planning, and long-horizon reasoning.

Gemini 3 Flash Preview

48.1

Grok 4

43.6

Gemini 2.5 Pro

43.3

34.9

Gemini 2.5 Flash

33.5

22 modelli testati

C-Eval

knowledge

GPT-4

68.7

LLaMA-13B

38.8

2 modelli testati

Chess Puzzles

knowledge

Chess Puzzles · tests strategic and tactical reasoning by having models solve chess puzzle positions, evaluating lookahead and pattern recognition abilities.

GPT-5.4 Pro

58.6

Gemini 3.1 Pro Preview

55.0

GPT-5.2

49.0

GPT-5.4

44.0

Gemini 3 Flash Preview

38.0

24 modelli testati

CMMLU

knowledge

Qwen2-72B

89.7

Qwen2.5 72B Instruct

85.7

GPT-4 Turbo

71.0

Llama 3.1 70B Instruct

64.4

Qwen-14B

58.7

8 modelli testati

CSQA2

knowledge

GPT-3.5 Turbo (older v0613)

14.0

Llama 2-13B

0.1

2 modelli testati

DeepResearch Bench

knowledge

DeepResearch Bench · evaluates AI on complex multi-step research tasks requiring information gathering, synthesis, and producing comprehensive analyses.

GPT-5

55.1

Claude Sonnet 4.5

52.6

Gemini 2.5 Pro

49.7

Claude Opus 4.1

49.7

Claude Opus 4

49.0

13 modelli testati

EnigmaEval

knowledge

Gemini 3.1 Pro Preview

19.8

13.1

2 modelli testati

Fiction.LiveBench

knowledge

Fiction.LiveBench · a continuously updated benchmark using recently published fiction to test reading comprehension and reasoning, preventing data contamination.

GPT-5

97.2

o3 Pro

97.2

Grok 4 Fast

94.4

Grok 4

94.4

Gemini 2.5 Pro

91.7

41 modelli testati

GeoBench

knowledge

GeoBench · tests geographic knowledge and spatial reasoning across countries, landmarks, coordinates, and geopolitical understanding.

Gemini 3 Flash Preview

88.0

Gemini 3 Pro

84.0

GPT-5

81.0

Gemini 2.5 Pro

81.0

80.0

26 modelli testati

GPQA

knowledge

Meta Llama 3 8B

19.7

Qwen2.5 72B Instruct Abliterated

19.4

Qwen2-72B

19.2

DeepSeek R1 Distill Qwen 14B

18.3

WizardLM-2 8x22B

17.6

73 modelli testati

GPQA diamond

knowledge

Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.

Claude Mythos Preview

94.5

GPT-5.4 Pro

92.8

Gemini 3.1 Pro Preview

92.1

GPT-5.4

91.1

Gemini 3 Pro

90.2

96 modelli testati

HellaSwag

knowledge

HellaSwag · tests commonsense reasoning by asking models to predict the most plausible continuation of everyday scenarios.

GPT-4 Turbo

93.7

Llama 3.1 405B

85.6

Falcon-180B

85.3

DeepSeek V3

85.2

DeepSeek-V2 (MoE-236B, May 2024)

82.8

29 modelli testati

HELM — GPQA

knowledge

Gemini 3 Pro

80.3

GPT-5 Chat

79.1

GPT-5 Mini

75.6

75.3

Gemini 2.5 Pro

74.9

34 modelli testati

HELM — MMLU-Pro

knowledge

Gemini 3 Pro

90.3

GPT-5 Chat

86.3

Gemini 2.5 Pro

86.3

85.9

Grok 4

85.1

34 modelli testati

HLE

knowledge

HLE (Humanity's Last Exam) · a reasoning benchmark designed to be the hardest public evaluation of AI. Questions span mathematics, physics, philosophy, and logic · curated to be at or beyond the frontier of human expert capability. Tested with and without tool augmentation. Claude Opus 4.7 scores 46.9% without tools and 54.7% with tools · making it one of the few benchmarks where the top score is below 60%.

Claude Mythos Preview

56.8

Gemini 3 Pro

34.4

Claude Opus 4.6

31.1

GPT-5 Pro

28.2

GPT-5.2

24.2

23 modelli testati

Humanity's Last Exam

knowledge

0 modelli testati

Humanity's Last Exam (Text Only)

knowledge

0 modelli testati

LAMBADA

knowledge

LAMBADA · measures the ability to predict the final word of a passage, requiring broad contextual understanding across long text spans.

Falcon-180B

79.8

Llama 2-13B

76.5

LLaMA-13B

75.2

Baichuan 2-7B

73.3

Stable Beluga 2

71.3

7 modelli testati

Lech Mazur Writing

knowledge

Lech Mazur Writing · evaluates creative writing ability, assessing prose quality, narrative coherence, and stylistic sophistication.

GPT-5

87.2

Qwen3 Max

87.1

Kimi K2 0711

86.9

o3 Pro

86.3

Gemini 2.5 Pro

86.0

39 modelli testati

LiveBench — Overall

knowledge

GPT-5.2-Codex

74.3

GPT-5.1-Codex-Max

72.0

Qwen3.6 Plus

70.8

GLM 5.1

70.2

GLM 5

68.8

29 modelli testati

MMLU

knowledge

Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.

DeepSeek V3

82.9

Claude 3.5 Sonnet

82.0

GPT-4 (older v0314)

81.9

Llama 3.3 70B Instruct (free)

81.7

Qwen2.5 72B Instruct

80.4

67 modelli testati

MMLU-PRO

knowledge

Qwen2-72B

52.6

Qwen2.5 32B Instruct

51.9

Qwen2.5 72B Instruct

51.4

Qwen2.5 72B Instruct Abliterated

50.4

Phi 4

48.6

73 modelli testati

MMMLU

knowledge

Claude Mythos Preview

92.7

1 modelli testati

MultiChallenge

knowledge

Gemini 3.1 Pro Preview

71.4

1 modelli testati

MultiNRC

knowledge

Gemini 3.1 Pro Preview

64.7

1 modelli testati

OpenBookQA

knowledge

OpenBookQA · science questions that require combining a given core fact with broad common knowledge, mimicking an open-book exam setting.

phi-3-mini 3.8B

84.0

phi-3-small 7.4B

84.0

phi-3-medium 14B

83.2

GPT-3.5 Turbo (older v0613)

81.3

Mixtral 8x7B Instruct

81.1

19 modelli testati

OpenCompass — GPQA-Diamond

knowledge

Qwen3.5 397B A17B

88.4

Kimi K2.5

88.1

GLM 4.7

86.9

DeepSeek V3.2 Speciale

86.7

GLM 5

85.3

32 modelli testati

OpenCompass — HLE

knowledge

DeepSeek V3.2 Speciale

28.6

Kimi K2.5

28.6

GLM 5

28.1

Qwen3.5 397B A17B

27.5

GLM 4.7

25.4

32 modelli testati

OpenCompass — MMLU-Pro

knowledge

Qwen3.5 397B A17B

87.6

Kimi K2.5

86.2

DeepSeek V3.2

85.8

Gemini 2.5 Pro

85.8

DeepSeek V3.2 Speciale

85.5

32 modelli testati

PIQA

knowledge

PIQA (Physical Interaction QA) · tests intuitive physical reasoning by asking models to select the correct approach for everyday physical tasks.

GPT-4o-mini (2024-07-18)

77.4

GPT-4o-mini

77.4

Gemini 1.5 Flash (May 2024)

75.0

Llama 3.1 405B

71.8

Falcon-180B

69.8

25 modelli testati

PostTrainBench

knowledge

Claude Opus 4.6

23.2

Gemini 3.1 Pro Preview

21.6

GPT-5.2

21.4

GPT-5.4

20.2

Gemini 3 Pro

18.1

15 modelli testati

Professional Reasoning — Finance

knowledge

Claude Opus 4.6 (Fast)

53.3

GPT-5

51.3

GPT-5 Pro

51.1

o3 Pro

49.1

47.7

5 modelli testati

Professional Reasoning — Legal

knowledge

Claude Opus 4.6 (Fast)

52.3

GPT-5 Pro

49.9

o3 Pro

49.7

GPT-5

49.0

48.6

5 modelli testati

ScienceQA

knowledge

ScienceQA · multimodal science questions spanning natural science, social science, and language science with diverse question formats and image context.

GPT-4o (2024-05-13)

84.7

GPT-4o (2024-11-20)

84.7

Claude 3 Haiku

62.7

Llama 2-13B

41.0

LLaMA-13B

24.4

5 modelli testati

SciPredict

knowledge

Claude Opus 4.5

23.1

Gemini 3 Flash Preview

22.2

2 modelli testati

SimpleQA Verified

knowledge

SimpleQA Verified · short factual questions with verified answers, measuring factual accuracy and the tendency to hallucinate or provide incorrect information.

Gemini 3.1 Pro Preview

77.3

Gemini 3 Pro

72.9

Qwen3 Max

67.5

Gemini 3 Flash Preview

67.4

Muse Spark

66.3

32 modelli testati

TriviaQA

knowledge

TriviaQA · reading comprehension benchmark with trivia questions, requiring models to find and reason over evidence from provided documents.

Claude 2

87.5

GPT-3.5 Turbo (older v0613)

85.8

GPT-4 Turbo

84.8

DeepSeek V3

82.9

Llama 3.1 405B

82.7

20 modelli testati

TutorBench

knowledge

Gemini 2.5 Pro Preview 06-05

55.6

Kimi K2.5

54.6

2 modelli testati

VISTA

knowledge

Gemini 2.5 Pro Preview 06-05

54.6

o4 Mini

51.8

2 modelli testati

VisualToolBench (VTB)

knowledge

Gemini 3.1 Pro Preview

29.0

Claude Opus 4.6 (Fast)

27.5

2 modelli testati

VPCT

knowledge

VPCT (Visual Pattern Completion Test) · tests visual reasoning and pattern recognition by having models complete visual sequences and transformations.

Gemini 3 Pro

86.5

GPT-5.2

76.0

Gemini 3 Flash Preview

58.9

GPT-5

49.0

GPT-5.1

38.0

22 modelli testati

Winogrande

knowledge

WinoGrande · large-scale commonsense reasoning benchmark where models must resolve ambiguous pronouns in carefully constructed sentence pairs.

Llama 3.1 405B

78.4

Claude 3 Opus

77.0

GPT-4 (older v0314)

75.0

GPT-4 Turbo

75.0

Falcon-180B

74.2

38 modelli testati

agentic

APEX-Agents

agentic

APEX-Agents · evaluates AI agents on complex, multi-step tasks requiring planning, tool use, and autonomous decision-making in realistic environments.

GPT-5.4

35.9

GPT-5.2

34.3

Gemini 3.1 Pro Preview

33.5

GPT-5.3-Codex

31.7

Claude Opus 4.6

31.7

17 modelli testati

MCP Atlas

agentic

Claude Opus 4.5

62.3

Gemini 3 Flash Preview

57.4

2 modelli testati

OSWorld

agentic

OSWorld · tests AI agents on real-world computer tasks across operating systems, including web browsing, file management, and application use.

Claude Mythos Preview

79.6

Claude Opus 4.5

66.3

Kimi K2.5

63.3

Claude Sonnet 4.5

62.9

Claude Sonnet 4

43.9

8 modelli testati

Remote Labor Index (RLI)

agentic

Claude Opus 4.6 (Fast)

4.2

1 modelli testati

SWE Atlas — Codebase QnA

agentic

Claude Opus 4.6 (Fast)

33.3

GPT-5.3-Codex

32.6

Claude Sonnet 4.6

31.2

3 modelli testati

SWE Atlas — Test Writing

agentic

Claude Opus 4.6 (Fast)

36.7

Claude Sonnet 4.6

31.8

2 modelli testati

SWE-Bench Pro (Private)

agentic

Claude Opus 4.5

23.4

Gemini 2.5 Pro Preview 06-05

10.1

2 modelli testati

SWE-Bench Pro (Public)

agentic

Claude Opus 4.5

45.9

GPT-5.2-Codex

41.0

2 modelli testati

The Agent Company

agentic

The Agent Company · tests AI agents on realistic corporate tasks like email management, code review, data analysis, and cross-tool workflows.

DeepSeek V3.2 Exp

42.9

Gemini 2.5 Flash

41.1

Claude Sonnet 4

33.1

Claude 3.7 Sonnet

30.9

Gemini 2.5 Pro

30.3

13 modelli testati

reasoning

ARC-AGI

reasoning

ARC-AGI · the original Abstraction and Reasoning Corpus, testing whether AI can solve novel visual pattern recognition tasks without memorization.

Gemini 3.1 Pro Preview

98.0

GPT-5.4 Pro

94.5

Claude Opus 4.6

94.0

GPT-5.4

93.7

GPT-5.2 Pro

90.5

48 modelli testati

ARC-AGI-2

reasoning

ARC-AGI-2 · the second iteration of the Abstraction and Reasoning Corpus, testing novel pattern recognition and abstract reasoning without prior training data.

GPT-5.4 Pro

83.3

Gemini 3.1 Pro Preview

77.1

GPT-5.4

74.0

Claude Opus 4.6

69.2

Claude Sonnet 4.6

60.4

50 modelli testati

BBH

reasoning

BIG-Bench Hard · a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average humans.

DeepSeek V3

83.3

Gemini 1.5 Pro (Feb 2024)

78.7

Llama 3.1 405B

77.2

phi-3-medium 14B

75.2

Qwen2.5 72B Instruct

73.1

24 modelli testati

CharXiv Reasoning

reasoning

CharXiv Reasoning · tests a model's ability to understand and reason about charts, plots, and figures extracted from real arXiv scientific papers. The model must answer questions that require reading axes, comparing data series, identifying trends, and drawing conclusions from visual scientific data. This benchmark specifically measures the intersection of visual understanding and scientific reasoning · a critical capability for research assistants and document analysis. Performance varies dramatically with and without tool use, making it a key differentiator for multimodal and agentic AI systems.

Claude Mythos Preview

86.1

1 modelli testati

CharXiv Reasoning (with tools)

reasoning

CharXiv Reasoning (with tools) · the tool-augmented variant of CharXiv Reasoning. Models can use code execution, image processing, and other tools to analyze charts from arXiv papers. Claude Mythos Preview scores 93.2% with tools vs 86.1% without · demonstrating how tool use dramatically improves visual scientific reasoning. The gap between tool-augmented and bare performance is a key signal for agent capability.

Claude Mythos Preview

93.2

1 modelli testati

GraphWalks BFS 256K-1M

reasoning

GraphWalks BFS 256K-1M · a long-context reasoning benchmark created by OpenAI that tests whether a model can perform breadth-first search (BFS) traversal across massive graphs encoded in 256,000 to 1,024,000 tokens of context. The model receives a graph represented as an edge list and must follow parent-child relationships across the entire extended context window. This is not simple retrieval · it requires true relational reasoning over hundreds of thousands of tokens. The dataset includes 100 problems at 256K context and 100 problems at 1,024K context. Claude Mythos Preview leads at 80.0%, more than doubling Opus 4.6 (38.7%) and far exceeding GPT-5.4 (21.4%). The massive performance gap between models makes this one of the most discriminating benchmarks for real long-context capability in 2026.

Claude Mythos Preview

80.0

1 modelli testati

HELM — WildBench

reasoning

GPT-5.1

86.3

Kimi K2 0711

86.2

86.1

Gemini 3 Pro

85.9

GPT-5 Chat

85.7

34 modelli testati

HLE (with tools)

reasoning

Claude Mythos Preview

64.7

1 modelli testati

LiveBench — Data Analysis

reasoning

GPT-5.2-Codex

78.2

Qwen3.6 Plus

69.9

GLM 5

67.9

GLM 5.1

63.2

GPT-5.1-Codex

60.8

29 modelli testati

LiveBench — Reasoning

reasoning

GPT-5.1-Codex-Max

84.6

GPT-5.1-Codex

82.0

GPT-5.2-Codex

77.7

Qwen3.6 Plus

75.8

MiniMax M2.7

74.8

29 modelli testati

MUSR

reasoning

DeepSeek R1 Distill Qwen 14B

28.7

Hermes 3 70B Instruct

23.4

Llama 3 8B Instruct

19.9

Qwen2-72B

19.7

Stable Beluga 2

18.6

73 modelli testati

SimpleBench

reasoning

SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.

Gemini 3.1 Pro Preview

75.5

Gemini 3 Pro

71.7

GPT-5.4 Pro

68.9

Claude Opus 4.6

61.1

Gemini 2.5 Pro

54.9

52 modelli testati

speed

Artificial Analysis — Agentic Index

speed

Artificial Analysis Agentic Index · a composite score measuring how well a model performs in agentic workflows · multi-step tool use, planning, error recovery, and autonomous task completion. Aggregates results from multiple agentic benchmarks including SWE-bench, tool-use tests, and planning evaluations. The canonical single-number metric for "how good is this model as an agent?"

GPT-5.4

69.4

Claude Opus 4.6 (Fast)

67.6

GLM 5.1

67.0

GLM 5 Turbo

63.1

Claude Sonnet 4.6

63.0

66 modelli testati

Artificial Analysis — Coding Index

speed

Artificial Analysis Coding Index · a composite score that aggregates performance across multiple coding benchmarks into a single index. Tracks code generation quality, debugging ability, multi-language competence, and real-world software engineering tasks. Used by Artificial Analysis to rank model coding capability in a normalized, comparable format. Useful for developers choosing between models for coding-heavy workloads.

GPT-5.4

57.3

Gemini 3.1 Pro Preview

55.5

GPT-5.3-Codex

53.1

GPT-5.4 Mini

51.5

Claude Sonnet 4.6

50.9

67 modelli testati

Artificial Analysis — Quality Index

speed

Gemini 3.1 Pro Preview

57.2

GPT-5.4

57.2

GPT-5.3-Codex

54.0

Claude Opus 4.6 (Fast)

53.0

Muse Spark

52.1

68 modelli testati

general

BBH (HuggingFace)

general

Qwen2.5 72B Instruct

61.9

Qwen2.5 72B Instruct Abliterated

60.5

Llama 3.3 70B Instruct

56.6

Qwen2.5 32B Instruct

56.5

Llama 3.1 70B Instruct

55.9

73 modelli testati

arena

Chatbot Arena Elo — Coding

arena

Claude Opus 4.6 (Fast)

1546.2

Claude Opus 4.6

1542.9

Claude Sonnet 4.6

1521.0

Claude Opus 4.5

1465.2

Gemini 3.1 Pro Preview

1455.7

27 modelli testati

Chatbot Arena Elo — Overall

arena

Claude Opus 4.6 (Fast)

1502.8

Claude Opus 4.6

1496.6

Gemini 3.1 Pro Preview

1492.6

Gemini 3 Pro

1486.2

Gemini 3 Flash Preview

1473.9

113 modelli testati

safety

Fortress

safety

Claude Opus 4.5

13.6

Claude 3.5 Sonnet

13.0

gpt-oss-120b

8.2

3 modelli testati

MASK

safety

Claude Opus 4.6 (Fast)

96.3

Claude Sonnet 4

95.3

2 modelli testati

PropensityBench

safety

Qwen2.5 32B Instruct

22.9

1 modelli testati

math

FrontierMath-2025-02-28-Private

math

FrontierMath (Feb 2025) · original research-level math problems created by mathematicians, testing capabilities at the boundary of current AI mathematical reasoning.

GPT-5.4 Pro

50.0

GPT-5.4

47.6

Claude Opus 4.6

40.7

GPT-5.2

40.7

Muse Spark

39.0

54 modelli testati

FrontierMath-Tier-4-2025-07-01-Private

math

FrontierMath Tier 4 (Jul 2025) · the most challenging tier of frontier mathematics, containing problems that push the absolute limits of AI mathematical reasoning.

GPT-5.4 Pro

37.5

GPT-5.2 Pro

31.3

GPT-5.4

27.1

Claude Opus 4.6

22.9

GPT-5.2

18.8

37 modelli testati

GSM8K

math

Grade School Math 8K · 8,500 linguistically diverse grade-school math word problems that require multi-step reasoning to solve.

GPT-4 (older v0314)

92.0

GPT-4o-mini (2024-07-18)

91.3

GPT-4o-mini

91.3

Qwen2.5 Coder 32B Instruct

91.1

GPT-4 Turbo

90.0

32 modelli testati

HELM — Omni-MATH

math

GPT-5 Mini

72.2

o4 Mini

72.0

71.4

gpt-oss-120b

68.8

Kimi K2 0711

65.4

34 modelli testati

LiveBench — Mathematics

math

GPT-5.2-Codex

88.8

GLM 5.1

84.9

Qwen3.6 Plus

83.7

GPT-5.1-Codex-Max

83.7

GLM 5

83.5

29 modelli testati

MATH level 5

math

MATH Level 5 · the hardest tier of the MATH benchmark, featuring competition-level problems from AMC, AIME, and Olympiad-style mathematics.

GPT-5

98.1

GPT-5 Mini

97.8

o4 Mini

97.8

Claude Sonnet 4.5

97.7

72 modelli testati

MATH Level 5

math

Qwen2.5 32B Instruct

62.5

Qwen2.5 72B Instruct Abliterated

60.1

Qwen2.5 72B Instruct

59.8

DeepSeek R1 Distill Qwen 14B

57.0

Qwen2.5 14B Instruct

55.3

73 modelli testati

OpenCompass — AIME2025

math

DeepSeek V3.2 Speciale

96.0

GLM 5

95.8

Step 3.5 Flash

95.7

GLM 4.7

95.4

Kimi K2 Thinking

94.1

32 modelli testati

OTIS Mock AIME 2024-2025

math

OTIS Mock AIME 2024-2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.

GPT-5.2

96.1

Gemini 3.1 Pro Preview

95.6

GPT-5.4

95.3

Claude Opus 4.6

94.4

Gemini 3 Flash Preview

92.8

86 modelli testati

USAMO

math

Claude Mythos Preview

97.6

1 modelli testati

language

HELM — IFEval

language

Grok 3 Mini Beta

95.1

Grok 4

94.9

GPT-5.1

93.5

GPT-5 Nano

93.2

o4 Mini

92.9

34 modelli testati

IFEval

language

Llama 3.3 70B Instruct

90.0

Llama 3.1 70B Instruct

86.7

Qwen2.5 72B Instruct

86.4

Qwen2.5 72B Instruct Abliterated

85.9

Qwen2.5 32B Instruct

83.5

73 modelli testati

JCommonsenseQA

language

DeepSeek R1 Distill Qwen 14B

93.7

Qwen2 7B Instruct

89.1

Qwen2 VL 7B Instruct

87.8

Meta Llama 3 8B Instruct

87.7

Meta Llama 3 8B

82.9

11 modelli testati

JHumanEval

language

0 modelli testati

JMMLU

language

DeepSeek R1 Distill Qwen 14B

63.4

Qwen2 7B Instruct

56.5

Qwen2 VL 7B Instruct

56.3

Meta Llama 3 8B Instruct

46.7

Meta Llama 3 8B

44.7

11 modelli testati

JNLI

language

DeepSeek R1 Distill Qwen 14B

82.4

Qwen2 7B Instruct

81.3

Qwen2 VL 7B Instruct

74.4

DeepSeek R1 Distill Llama 8B

69.4

Meta Llama 3 8B Instruct

61.1

11 modelli testati

JSQuAD

language

Qwen2 VL 7B Instruct

89.9

DeepSeek R1 Distill Qwen 14B

89.8

Qwen2 7B Instruct

89.6

Meta Llama 3 8B Instruct

89.5

Meta Llama 3 8B

88.9

11 modelli testati

LiveBench — If

language

GLM 5.1

68.5

Gemma 4 31B

67.6

GPT-5.1-Codex-Max

67.1

GPT-5.2-Codex

66.5

GPT-5 Mini

64.2

29 modelli testati

LiveBench — Language

language

GLM 5

77.5

GPT-5.1-Codex-Max

75.4

Qwen3.6 Plus

75.0

GPT-5.2-Codex

73.7

GLM 5.1

71.8

29 modelli testati

LLM-JP — Overall

language

DeepSeek R1 Distill Qwen 14B

56.8

Qwen2 VL 7B Instruct

53.0

Qwen2 7B Instruct

51.7

Meta Llama 3 8B Instruct

49.6

Meta Llama 3 8B

48.9

11 modelli testati

MMMLU — Arabic

language

Qwen2 7B Instruct

50.7

Meta Llama 3 8B Instruct

40.5

2 modelli testati

MMMLU — Bengali

language

Qwen2 7B Instruct

43.4

Meta Llama 3 8B Instruct

36.4

2 modelli testati

MMMLU — Chinese

language

Qwen2 7B Instruct

61.8

Meta Llama 3 8B Instruct

51.4

2 modelli testati

MMMLU — French

language

Qwen2 7B Instruct

60.8

Meta Llama 3 8B Instruct

55.8

2 modelli testati

MMMLU — German

language

Qwen2 7B Instruct

57.1

Meta Llama 3 8B Instruct

53.5

2 modelli testati

MMMLU — Hindi

language

Qwen2 7B Instruct

45.1

Meta Llama 3 8B Instruct

41.4

2 modelli testati

MMMLU — Indonesian

language

Qwen2 7B Instruct

54.1

Meta Llama 3 8B Instruct

51.0

2 modelli testati

MMMLU — Italian

language

Qwen2 7B Instruct

59.0

Meta Llama 3 8B Instruct

53.3

2 modelli testati

MMMLU — Japanese

language

Qwen2 7B Instruct

56.6

Meta Llama 3 8B Instruct

42.3

2 modelli testati

MMMLU — Korean

language

Qwen2 7B Instruct

54.0

Meta Llama 3 8B Instruct

46.5

2 modelli testati

MMMLU — Portuguese

language

Qwen2 7B Instruct

60.1

Meta Llama 3 8B Instruct

55.5

2 modelli testati

MMMLU — Spanish

language

Qwen2 7B Instruct

60.2

Meta Llama 3 8B Instruct

55.8

2 modelli testati

MMMLU — Swahili

language

Meta Llama 3 8B Instruct

37.5

Qwen2 7B Instruct

34.3

2 modelli testati

MMMLU — Yoruba

language

Meta Llama 3 8B Instruct

31.0

Qwen2 7B Instruct

30.2

2 modelli testati

OpenCompass — IFEval

language

Kimi K2.5

93.9

GLM 5

93.2

Step 3.5 Flash

93.2

Kimi K2 Thinking

92.4

DeepSeek V3.2 Speciale

91.7

32 modelli testati

multimodal

VideoMME

multimodal

VideoMME · multimodal benchmark testing video understanding across diverse domains, requiring temporal reasoning and cross-frame comprehension.

Gemini 1.5 Pro (Feb 2024)

66.7

Qwen2.5 72B Instruct

64.7

GPT-4o (2024-11-20)

62.5

GPT-4o (2024-08-06)

62.5

Gemini 1.5 Flash (May 2024)

60.4

8 modelli testati