Benchmarks

128 benchmarks em 11 categorias. Clique para ver a classificação completa.

coding

Aider — Code Editing

coding

Claude 3.5 Sonnet

84.2

o1-preview

79.7

GPT-4o (2024-05-13)

72.9

GPT-4o (2024-08-06)

71.4

27 modelos testados

Aider polyglot

coding

Aider Polyglot · measures how well AI models can edit code across multiple programming languages using the Aider coding assistant framework.

GPT-5

88.0

GPT-5 Chat

88.0

o3 Pro

84.9

Gemini 2.5 Pro

83.1

Gemini 2.5 Pro Preview 06-05

83.1

53 modelos testados

CadEval

coding

CadEval · evaluates the ability to generate and reason about Computer-Aided Design code, testing spatial reasoning and engineering knowledge.

74.0

Gemini 2.5 Pro

64.0

o4 Mini

62.0

56.0

Claude 3.7 Sonnet

54.0

15 modelos testados

Cybench

coding

Cybench · evaluates AI on real Capture-The-Flag cybersecurity challenges, testing vulnerability analysis, exploitation, and security reasoning.

Claude Opus 4.6

93.0

Claude Opus 4.5

82.0

Claude Sonnet 4.5

60.0

Grok 4

43.0

Claude Opus 4.1

42.0

20 modelos testados

GSO-Bench

coding

GSO-Bench · evaluates AI models on real-world open-source software engineering tasks, testing the ability to understand and resolve actual GitHub issues.

Claude Opus 4.6

33.3

GPT-5.2

27.4

Claude Opus 4.5

26.5

Gemini 3 Pro

18.6

Claude Sonnet 4.5

14.7

18 modelos testados

LiveBench — Agentic Coding

coding

GPT-5.1-Codex-Max

56.7

GLM 5

55.0

GLM 5.1

55.0

Qwen3.6 Plus

55.0

GPT-5.1-Codex

53.3

29 modelos testados

LiveBench — Coding

coding

GPT-5.2-Codex

83.6

GPT-5.1-Codex-Max

81.4

Qwen3.6 Plus

78.2

GPT-5 Mini

76.1

DeepSeek V3.2

75.7

29 modelos testados

OpenCompass — LiveCodeBenchV6

coding

GLM 5

86.2

Step 3.5 Flash

83.9

GLM 4.7

83.8

Qwen3.5 397B A17B

83.0

DeepSeek V3.2 Speciale

80.9

32 modelos testados

SWE-bench Multilingual

coding

Claude Mythos Preview

87.3

1 modelos testados

SWE-bench Multimodal

coding

Claude Mythos Preview

59.0

1 modelos testados

SWE-bench Pro

coding

Claude Mythos Preview

77.8

1 modelos testados

SWE-Bench verified

coding

SWE-bench Verified · 500 human-validated tasks from 12 real Python repositories (Django, Flask, scikit-learn, sympy, and others). Each task requires the model to produce a git patch that resolves a real GitHub issue and passes the test suite. The verified subset eliminates ambiguous tasks from the original SWE-bench. Claude Mythos Preview leads at 93.9%, crossing 90% for the first time in 2026. Opus 4.6 scores 80.8%. The benchmark remains the most-cited evaluation for code-generation capability.

Claude Mythos Preview

93.9

Claude Opus 4.6

78.7

GPT-5.4

76.9

Claude Opus 4.5

76.7

Gemini 3.1 Pro Preview

75.6

23 modelos testados

SWE-Bench Verified (Bash Only)

coding

SWE-Bench Verified (Bash Only) · a curated subset of SWE-bench where models fix real Python repository bugs using only bash commands, no agent frameworks.

Claude Opus 4.5

74.4

GPT-5.2

71.8

Claude Sonnet 4.5

70.6

Claude Opus 4

67.6

GPT-5.1

66.0

19 modelos testados

Terminal Bench

coding

Terminal-Bench 2.0 · evaluates AI agents on real terminal-based coding tasks · writing scripts, debugging, running tests, and managing projects entirely through command-line interaction. Tests both code quality and terminal fluency. Claude Opus 4.7 scores 69.4%, demonstrating significant agentic terminal competence.

GPT-5.5

82.7

Claude Mythos Preview

82.0

Gemini 3.1 Pro Preview

78.4

GPT-5.3-Codex

77.3

Claude Opus 4.6

74.7

28 modelos testados

WeirdML

coding

WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns.

GPT-5.3-Codex

79.3

Claude Opus 4.6

77.9

GPT-5.2

72.2

Gemini 3.1 Pro Preview

72.1

Gemini 3 Pro

69.9

70 modelos testados

knowledge

ANLI

knowledge

ANLI (Adversarial NLI) · adversarially constructed natural language inference dataset where each round targets weaknesses found in previous model generations.

GPT-3.5 Turbo (older v0613)

37.1

phi-3-small 7.4B

37.1

Llama 3 8B Instruct

36.0

phi-3-medium 14B

33.7

Mixtral 8x7B Instruct

32.8

9 modelos testados

ARC AI2

knowledge

AI2 Reasoning Challenge · tests grade-school level science knowledge with multiple-choice questions requiring reasoning beyond simple retrieval.

DeepSeek V3

93.7

Llama 3.1 405B

93.7

Qwen2.5 72B Instruct

92.7

DeepSeek-V2 (MoE-236B, May 2024)

89.6

phi-3-medium 14B

88.8

35 modelos testados

AudioMultiChallenge

knowledge

Gemini 2.5 Pro

46.9

Gemini 2.5 Flash

40.0

2 modelos testados

AudioMultiChallenge — Audio Output

knowledge

0 modelos testados

AudioMultiChallenge — Text Output

knowledge

Gemini 2.5 Pro

46.9

Gemini 2.5 Flash

40.0

Voxtral Small 24B 2507

26.3

3 modelos testados

Balrog

knowledge

Balrog · benchmarks AI agents on text-based adventure games, testing language understanding, strategic planning, and long-horizon reasoning.

Gemini 3 Flash Preview

48.1

Grok 4

43.6

Gemini 2.5 Pro

43.3

34.9

Gemini 2.5 Flash

33.5

22 modelos testados

C-Eval

knowledge

GPT-4

68.7

LLaMA-13B

38.8

2 modelos testados

Chess Puzzles

knowledge

Chess Puzzles · tests strategic and tactical reasoning by having models solve chess puzzle positions, evaluating lookahead and pattern recognition abilities.

GPT-5.4 Pro

58.6

Gemini 3.1 Pro Preview

55.0

GPT-5.2

49.0

GPT-5.4

44.0

Gemini 3 Flash Preview

38.0

24 modelos testados

CMMLU

knowledge

Qwen2-72B

89.7

Qwen2.5 72B Instruct

85.7

GPT-4 Turbo

71.0

Llama 3.1 70B Instruct

64.4

Qwen-14B

58.7

8 modelos testados

CSQA2

knowledge

GPT-3.5 Turbo (older v0613)

14.0

Llama 2-13B

0.1

2 modelos testados

DeepResearch Bench

knowledge

DeepResearch Bench · evaluates AI on complex multi-step research tasks requiring information gathering, synthesis, and producing comprehensive analyses.

GPT-5

55.1

Claude Sonnet 4.5

52.6

Gemini 2.5 Pro

49.7

Claude Opus 4.1

49.7

Claude Opus 4

49.0

13 modelos testados

EnigmaEval

knowledge

Gemini 3.1 Pro Preview

19.8

13.1

2 modelos testados

Fiction.LiveBench

knowledge

Fiction.LiveBench · a continuously updated benchmark using recently published fiction to test reading comprehension and reasoning, preventing data contamination.

GPT-5

97.2

o3 Pro

97.2

Grok 4

94.4

Grok 4 Fast

94.4

Gemini 2.5 Pro

91.7

41 modelos testados

GeoBench

knowledge

GeoBench · tests geographic knowledge and spatial reasoning across countries, landmarks, coordinates, and geopolitical understanding.

Gemini 3 Flash Preview

88.0

Gemini 3 Pro

84.0

Gemini 2.5 Pro

81.0

GPT-5

81.0

80.0

26 modelos testados

GPQA

knowledge

Meta Llama 3 8B

19.7

Qwen2.5 72B Instruct Abliterated

19.4

Qwen2-72B

19.2

DeepSeek R1 Distill Qwen 14B

18.3

WizardLM-2 8x22B

17.6

73 modelos testados

GPQA diamond

knowledge

Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.

Claude Mythos Preview

94.5

GPT-5.5 Pro

94.2

GPT-5.5

93.6

GPT-5.4 Pro

92.8

Gemini 3.1 Pro Preview

92.1

98 modelos testados

HellaSwag

knowledge

HellaSwag · tests commonsense reasoning by asking models to predict the most plausible continuation of everyday scenarios.

GPT-4 Turbo

93.7

Llama 3.1 405B

85.6

Falcon-180B

85.3

DeepSeek V3

85.2

DeepSeek-V2 (MoE-236B, May 2024)

82.8

29 modelos testados

HELM — GPQA

knowledge

Gemini 3 Pro

80.3

GPT-5 Chat

79.1

GPT-5 Mini

75.6

75.3

Gemini 2.5 Pro

74.9

34 modelos testados

HELM — MMLU-Pro

knowledge

Gemini 3 Pro

90.3

Gemini 2.5 Pro

86.3

GPT-5 Chat

86.3

85.9

Grok 4

85.1

34 modelos testados

HLE

knowledge

HLE (Humanity's Last Exam) · a reasoning benchmark designed to be the hardest public evaluation of AI. Questions span mathematics, physics, philosophy, and logic · curated to be at or beyond the frontier of human expert capability. Tested with and without tool augmentation. Claude Opus 4.7 scores 46.9% without tools and 54.7% with tools · making it one of the few benchmarks where the top score is below 60%.

Claude Mythos Preview

56.8

Gemini 3 Pro

34.4

Claude Opus 4.6

31.1

GPT-5 Pro

28.2

GPT-5.2

24.2

23 modelos testados

Humanity's Last Exam

knowledge

0 modelos testados

Humanity's Last Exam (Text Only)

knowledge

0 modelos testados

LAMBADA

knowledge

LAMBADA · measures the ability to predict the final word of a passage, requiring broad contextual understanding across long text spans.

Falcon-180B

79.8

Llama 2-13B

76.5

LLaMA-13B

75.2

Baichuan 2-7B

73.3

Stable Beluga 2

71.3

7 modelos testados

Lech Mazur Writing

knowledge

Lech Mazur Writing · evaluates creative writing ability, assessing prose quality, narrative coherence, and stylistic sophistication.

GPT-5

87.2

Qwen3 Max

87.1

Kimi K2 0711

86.9

o3 Pro

86.3

Gemini 2.5 Pro

86.0

39 modelos testados

LiveBench — Overall

knowledge

GPT-5.2-Codex

74.3

GPT-5.1-Codex-Max

72.0

Qwen3.6 Plus

70.8

GLM 5.1

70.2

GLM 5

68.8

29 modelos testados

MMLU

knowledge

Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.

DeepSeek V3

82.9

Claude 3.5 Sonnet

82.0

GPT-4 (older v0314)

81.9

Llama 3.3 70B Instruct (free)

81.7

Qwen2.5 72B Instruct

80.4

67 modelos testados

MMLU-PRO

knowledge

Qwen2-72B

52.6

Qwen2.5 32B Instruct

51.9

Qwen2.5 72B Instruct

51.4

Qwen2.5 72B Instruct Abliterated

50.4

Phi 4

48.6

73 modelos testados

MMMLU

knowledge

Claude Mythos Preview

92.7

1 modelos testados

MultiChallenge

knowledge

Gemini 3.1 Pro Preview

71.4

1 modelos testados

MultiNRC

knowledge

Gemini 3.1 Pro Preview

64.7

1 modelos testados

OpenBookQA

knowledge

OpenBookQA · science questions that require combining a given core fact with broad common knowledge, mimicking an open-book exam setting.

phi-3-mini 3.8B

84.0

phi-3-small 7.4B

84.0

phi-3-medium 14B

83.2

GPT-3.5 Turbo (older v0613)

81.3

Mixtral 8x7B Instruct

81.1

19 modelos testados

OpenCompass — GPQA-Diamond

knowledge

Qwen3.5 397B A17B

88.4

Kimi K2.5

88.1

GLM 4.7

86.9

DeepSeek V3.2 Speciale

86.7

GLM 5

85.3

32 modelos testados

OpenCompass — HLE

knowledge

DeepSeek V3.2 Speciale

28.6

Kimi K2.5

28.6

GLM 5

28.1

Qwen3.5 397B A17B

27.5

GLM 4.7

25.4

32 modelos testados

OpenCompass — MMLU-Pro

knowledge

Qwen3.5 397B A17B

87.6

Kimi K2.5

86.2

DeepSeek V3.2

85.8

Gemini 2.5 Pro

85.8

DeepSeek V3.2 Speciale

85.5

32 modelos testados

PIQA

knowledge

PIQA (Physical Interaction QA) · tests intuitive physical reasoning by asking models to select the correct approach for everyday physical tasks.

GPT-4o-mini

77.4

GPT-4o-mini (2024-07-18)

77.4

Gemini 1.5 Flash (May 2024)

75.0

Llama 3.1 405B

71.8

Falcon-180B

69.8

25 modelos testados

PostTrainBench

knowledge

Claude Opus 4.6

23.2

Gemini 3.1 Pro Preview

21.6

GPT-5.2

21.4

GPT-5.4

20.2

Gemini 3 Pro

18.1

15 modelos testados

Professional Reasoning — Finance

knowledge

Claude Opus 4.6 (Fast)

53.3

GPT-5

51.3

GPT-5 Pro

51.1

o3 Pro

49.1

47.7

5 modelos testados

Professional Reasoning — Legal

knowledge

Claude Opus 4.6 (Fast)

52.3

GPT-5 Pro

49.9

o3 Pro

49.7

GPT-5

49.0

48.6

5 modelos testados

ScienceQA

knowledge

ScienceQA · multimodal science questions spanning natural science, social science, and language science with diverse question formats and image context.

GPT-4o (2024-05-13)

84.7

GPT-4o (2024-11-20)

84.7

Claude 3 Haiku

62.7

Llama 2-13B

41.0

LLaMA-13B

24.4

5 modelos testados

SciPredict

knowledge

Claude Opus 4.5

23.1

Gemini 3 Flash Preview

22.2

2 modelos testados

SimpleQA Verified

knowledge

SimpleQA Verified · short factual questions with verified answers, measuring factual accuracy and the tendency to hallucinate or provide incorrect information.

Gemini 3.1 Pro Preview

77.3

Gemini 3 Pro

72.9

Qwen3 Max

67.5

Gemini 3 Flash Preview

67.4

Muse Spark

66.3

32 modelos testados

TriviaQA

knowledge

TriviaQA · reading comprehension benchmark with trivia questions, requiring models to find and reason over evidence from provided documents.

Claude 2

87.5

GPT-3.5 Turbo (older v0613)

85.8

GPT-4 Turbo

84.8

DeepSeek V3

82.9

Llama 3.1 405B

82.7

20 modelos testados

TutorBench

knowledge

Gemini 2.5 Pro Preview 06-05

55.6

Kimi K2.5

54.6

2 modelos testados

VISTA

knowledge

Gemini 2.5 Pro Preview 06-05

54.6

o4 Mini

51.8

2 modelos testados

VisualToolBench (VTB)

knowledge

Gemini 3.1 Pro Preview

29.0

Claude Opus 4.6 (Fast)

27.5

2 modelos testados

VPCT

knowledge

VPCT (Visual Pattern Completion Test) · tests visual reasoning and pattern recognition by having models complete visual sequences and transformations.

Gemini 3 Pro

86.5

GPT-5.2

76.0

Gemini 3 Flash Preview

58.9

GPT-5

49.0

GPT-5.1

38.0

22 modelos testados

Winogrande

knowledge

WinoGrande · large-scale commonsense reasoning benchmark where models must resolve ambiguous pronouns in carefully constructed sentence pairs.

Llama 3.1 405B

78.4

Claude 3 Opus

77.0

GPT-4 (older v0314)

75.0

GPT-4 Turbo

75.0

Falcon-180B

74.2

38 modelos testados

agentic

APEX-Agents

agentic

APEX-Agents · evaluates AI agents on complex, multi-step tasks requiring planning, tool use, and autonomous decision-making in realistic environments.

GPT-5.4

35.9

GPT-5.2

34.3

Gemini 3.1 Pro Preview

33.5

Claude Opus 4.6

31.7

GPT-5.3-Codex

31.7

17 modelos testados

MCP Atlas

agentic

Claude Opus 4.5

62.3

Gemini 3 Flash Preview

57.4

2 modelos testados

OSWorld

agentic

OSWorld · tests AI agents on real-world computer tasks across operating systems, including web browsing, file management, and application use.

Claude Mythos Preview

79.6

GPT-5.5

78.7

Claude Opus 4.5

66.3

Kimi K2.5

63.3

Claude Sonnet 4.5

62.9

9 modelos testados

Remote Labor Index (RLI)

agentic

Claude Opus 4.6 (Fast)

4.2

1 modelos testados

SWE Atlas — Codebase QnA

agentic

Claude Opus 4.6 (Fast)

33.3

GPT-5.3-Codex

32.6

Claude Sonnet 4.6

31.2

3 modelos testados

SWE Atlas — Test Writing

agentic

Claude Opus 4.6 (Fast)

36.7

Claude Sonnet 4.6

31.8

2 modelos testados

SWE-Bench Pro (Private)

agentic

Claude Opus 4.5

23.4

Gemini 2.5 Pro Preview 06-05

10.1

2 modelos testados

SWE-Bench Pro (Public)

agentic

Claude Opus 4.5

45.9

GPT-5.2-Codex

41.0

2 modelos testados

The Agent Company

agentic

The Agent Company · tests AI agents on realistic corporate tasks like email management, code review, data analysis, and cross-tool workflows.

DeepSeek V3.2 Exp

42.9

Gemini 2.5 Flash

41.1

Claude Sonnet 4

33.1

Claude 3.7 Sonnet

30.9

Gemini 2.5 Pro

30.3

13 modelos testados

reasoning

ARC-AGI

reasoning

ARC-AGI · the original Abstraction and Reasoning Corpus, testing whether AI can solve novel visual pattern recognition tasks without memorization.

Gemini 3.1 Pro Preview

98.0

GPT-5.5

95.0

GPT-5.4 Pro

94.5

Claude Opus 4.6

94.0

GPT-5.4

93.7

49 modelos testados

ARC-AGI-2

reasoning

ARC-AGI-2 · the second iteration of the Abstraction and Reasoning Corpus, testing novel pattern recognition and abstract reasoning without prior training data.

GPT-5.4 Pro

83.3

Gemini 3.1 Pro Preview

77.1

GPT-5.4

74.0

Claude Opus 4.6

69.2

Claude Sonnet 4.6

60.4

50 modelos testados

BBH

reasoning

BIG-Bench Hard · a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average humans.

DeepSeek V3

83.3

Gemini 1.5 Pro (Feb 2024)

78.7

Llama 3.1 405B

77.2

phi-3-medium 14B

75.2

Qwen2.5 72B Instruct

73.1

24 modelos testados

CharXiv Reasoning

reasoning

CharXiv Reasoning · tests a model's ability to understand and reason about charts, plots, and figures extracted from real arXiv scientific papers. The model must answer questions that require reading axes, comparing data series, identifying trends, and drawing conclusions from visual scientific data. This benchmark specifically measures the intersection of visual understanding and scientific reasoning · a critical capability for research assistants and document analysis. Performance varies dramatically with and without tool use, making it a key differentiator for multimodal and agentic AI systems.

Claude Mythos Preview

86.1

1 modelos testados

CharXiv Reasoning (with tools)

reasoning

CharXiv Reasoning (with tools) · the tool-augmented variant of CharXiv Reasoning. Models can use code execution, image processing, and other tools to analyze charts from arXiv papers. Claude Mythos Preview scores 93.2% with tools vs 86.1% without · demonstrating how tool use dramatically improves visual scientific reasoning. The gap between tool-augmented and bare performance is a key signal for agent capability.

Claude Mythos Preview

93.2

1 modelos testados

GraphWalks BFS 256K-1M

reasoning

GraphWalks BFS 256K-1M · a long-context reasoning benchmark created by OpenAI that tests whether a model can perform breadth-first search (BFS) traversal across massive graphs encoded in 256,000 to 1,024,000 tokens of context. The model receives a graph represented as an edge list and must follow parent-child relationships across the entire extended context window. This is not simple retrieval · it requires true relational reasoning over hundreds of thousands of tokens. The dataset includes 100 problems at 256K context and 100 problems at 1,024K context. Claude Mythos Preview leads at 80.0%, more than doubling Opus 4.6 (38.7%) and far exceeding GPT-5.4 (21.4%). The massive performance gap between models makes this one of the most discriminating benchmarks for real long-context capability in 2026.

Claude Mythos Preview

80.0

1 modelos testados

HELM — WildBench

reasoning

GPT-5.1

86.3

Kimi K2 0711

86.2

86.1

Gemini 3 Pro

85.9

Gemini 2.5 Pro

85.7

34 modelos testados

HLE (with tools)

reasoning

Claude Mythos Preview

64.7

1 modelos testados

LiveBench — Data Analysis

reasoning

GPT-5.2-Codex

78.2

Qwen3.6 Plus

69.9

GLM 5

67.9

GLM 5.1

63.2

GPT-5.1-Codex

60.8

29 modelos testados

LiveBench — Reasoning

reasoning

GPT-5.1-Codex-Max

84.6

GPT-5.1-Codex

82.0

GPT-5.2-Codex

77.7

Qwen3.6 Plus

75.8

MiniMax M2.7

74.8

29 modelos testados

MUSR

reasoning

DeepSeek R1 Distill Qwen 14B

28.7

Hermes 3 70B Instruct

23.4

Llama 3 8B Instruct

19.9

Qwen2-72B

19.7

Stable Beluga 2

18.6

73 modelos testados

SimpleBench

reasoning

SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.

Gemini 3.1 Pro Preview

75.5

Gemini 3 Pro

71.7

GPT-5.4 Pro

68.9

Claude Opus 4.6

61.1

Gemini 2.5 Pro

54.9

52 modelos testados

speed

Artificial Analysis — Agentic Index

speed

Artificial Analysis Agentic Index · a composite score measuring how well a model performs in agentic workflows · multi-step tool use, planning, error recovery, and autonomous task completion. Aggregates results from multiple agentic benchmarks including SWE-bench, tool-use tests, and planning evaluations. The canonical single-number metric for "how good is this model as an agent?"

GPT-5.4

69.4

Claude Opus 4.6 (Fast)

67.6

GLM 5.1

67.0

GLM 5 Turbo

63.1

Claude Sonnet 4.6

63.0

66 modelos testados

Artificial Analysis — Coding Index

speed

Artificial Analysis Coding Index · a composite score that aggregates performance across multiple coding benchmarks into a single index. Tracks code generation quality, debugging ability, multi-language competence, and real-world software engineering tasks. Used by Artificial Analysis to rank model coding capability in a normalized, comparable format. Useful for developers choosing between models for coding-heavy workloads.

GPT-5.4

57.3

Gemini 3.1 Pro Preview

55.5

GPT-5.3-Codex

53.1

GPT-5.4 Mini

51.5

Claude Sonnet 4.6

50.9

67 modelos testados

Artificial Analysis — Quality Index

speed

Gemini 3.1 Pro Preview

57.2

GPT-5.4

57.2

GPT-5.3-Codex

54.0

Claude Opus 4.6 (Fast)

53.0

Muse Spark

52.1

68 modelos testados

general

BBH (HuggingFace)

general

Qwen2.5 72B Instruct

61.9

Qwen2.5 72B Instruct Abliterated

60.5

Llama 3.3 70B Instruct

56.6

Qwen2.5 32B Instruct

56.5

Llama 3.1 70B Instruct

55.9

73 modelos testados

arena

Chatbot Arena Elo — Coding

arena

Claude Opus 4.6 (Fast)

1546.2

Claude Opus 4.6

1542.9

Claude Sonnet 4.6

1521.0

Claude Opus 4.5

1465.2

Gemini 3.1 Pro Preview

1455.7

27 modelos testados

Chatbot Arena Elo — Overall

arena

Claude Opus 4.6 (Fast)

1502.8

Claude Opus 4.6

1496.6

Gemini 3.1 Pro Preview

1492.6

Gemini 3 Pro

1486.2

Gemini 3 Flash Preview

1473.9

113 modelos testados

safety

Fortress

safety

Claude Opus 4.5

13.6

Claude 3.5 Sonnet

13.0

gpt-oss-120b

8.2

3 modelos testados

MASK

safety

Claude Opus 4.6 (Fast)

96.3

Claude Sonnet 4

95.3

2 modelos testados

PropensityBench

safety

Qwen2.5 32B Instruct

22.9

1 modelos testados

math

FrontierMath-2025-02-28-Private

math

FrontierMath (Feb 2025) · original research-level math problems created by mathematicians, testing capabilities at the boundary of current AI mathematical reasoning.

GPT-5.4 Pro

50.0

GPT-5.4

47.6

Claude Opus 4.6

40.7

GPT-5.2

40.7

Muse Spark

39.0

54 modelos testados

FrontierMath-Tier-4-2025-07-01-Private

math

FrontierMath Tier 4 (Jul 2025) · the most challenging tier of frontier mathematics, containing problems that push the absolute limits of AI mathematical reasoning.

GPT-5.4 Pro

37.5

GPT-5.2 Pro

31.3

GPT-5.4

27.1

Claude Opus 4.6

22.9

GPT-5.2

18.8

37 modelos testados

GSM8K

math

Grade School Math 8K · 8,500 linguistically diverse grade-school math word problems that require multi-step reasoning to solve.

GPT-4 (older v0314)

92.0

GPT-4o-mini

91.3

GPT-4o-mini (2024-07-18)

91.3

Qwen2.5 Coder 32B Instruct

91.1

GPT-4 Turbo

90.0

32 modelos testados

HELM — Omni-MATH

math

GPT-5 Mini

72.2

o4 Mini

72.0

71.4

gpt-oss-120b

68.8

Kimi K2 0711

65.4

34 modelos testados

LiveBench — Mathematics

math

GPT-5.2-Codex

88.8

GLM 5.1

84.9

Qwen3.6 Plus

83.7

GPT-5.1-Codex-Max

83.7

GLM 5

83.5

29 modelos testados

MATH level 5

math

MATH Level 5 · the hardest tier of the MATH benchmark, featuring competition-level problems from AMC, AIME, and Olympiad-style mathematics.

GPT-5

98.1

GPT-5 Mini

97.8

o4 Mini

97.8

Claude Sonnet 4.5

97.7

72 modelos testados

MATH Level 5

math

Qwen2.5 32B Instruct

62.5

Qwen2.5 72B Instruct Abliterated

60.1

Qwen2.5 72B Instruct

59.8

DeepSeek R1 Distill Qwen 14B

57.0

Qwen2.5 14B Instruct

55.3

73 modelos testados

OpenCompass — AIME2025

math

DeepSeek V3.2 Speciale

96.0

GLM 5

95.8

Step 3.5 Flash

95.7

GLM 4.7

95.4

Kimi K2 Thinking

94.1

32 modelos testados

OTIS Mock AIME 2024-2025

math

OTIS Mock AIME 2024-2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.

GPT-5.2

96.1

Gemini 3.1 Pro Preview

95.6

GPT-5.4

95.3

Claude Opus 4.6

94.4

Gemini 3 Flash Preview

92.8

86 modelos testados

USAMO

math

Claude Mythos Preview

97.6

1 modelos testados

language

HELM — IFEval

language

Grok 3 Mini Beta

95.1

Grok 4

94.9

GPT-5.1

93.5

GPT-5 Nano

93.2

o4 Mini

92.9

34 modelos testados

IFEval

language

Llama 3.3 70B Instruct

90.0

Llama 3.1 70B Instruct

86.7

Qwen2.5 72B Instruct

86.4

Qwen2.5 72B Instruct Abliterated

85.9

Qwen2.5 32B Instruct

83.5

73 modelos testados

JCommonsenseQA

language

DeepSeek R1 Distill Qwen 14B

93.7

Qwen2 7B Instruct

89.1

Qwen2 VL 7B Instruct

87.8

Meta Llama 3 8B Instruct

87.7

Meta Llama 3 8B

82.9

11 modelos testados

JHumanEval

language

0 modelos testados

JMMLU

language

DeepSeek R1 Distill Qwen 14B

63.4

Qwen2 7B Instruct

56.5

Qwen2 VL 7B Instruct

56.3

Meta Llama 3 8B Instruct

46.7

Meta Llama 3 8B

44.7

11 modelos testados

JNLI

language

DeepSeek R1 Distill Qwen 14B

82.4

Qwen2 7B Instruct

81.3

Qwen2 VL 7B Instruct

74.4

DeepSeek R1 Distill Llama 8B

69.4

Meta Llama 3 8B Instruct

61.1

11 modelos testados

JSQuAD

language

Qwen2 VL 7B Instruct

89.9

DeepSeek R1 Distill Qwen 14B

89.8

Qwen2 7B Instruct

89.6

Meta Llama 3 8B Instruct

89.5

Meta Llama 3 8B

88.9

11 modelos testados

LiveBench — If

language

GLM 5.1

68.5

Gemma 4 31B

67.6

GPT-5.1-Codex-Max

67.1

GPT-5.2-Codex

66.5

GPT-5 Mini

64.2

29 modelos testados

LiveBench — Language

language

GLM 5

77.5

GPT-5.1-Codex-Max

75.4

Qwen3.6 Plus

75.0

GPT-5.2-Codex

73.7

GLM 5.1

71.8

29 modelos testados

LLM-JP — Overall

language

DeepSeek R1 Distill Qwen 14B

56.8

Qwen2 VL 7B Instruct

53.0

Qwen2 7B Instruct

51.7

Meta Llama 3 8B Instruct

49.6

Meta Llama 3 8B

48.9

11 modelos testados

MMMLU — Arabic

language

Qwen2 7B Instruct

50.7

Meta Llama 3 8B Instruct

40.5

2 modelos testados

MMMLU — Bengali

language

Qwen2 7B Instruct

43.4

Meta Llama 3 8B Instruct

36.4

2 modelos testados

MMMLU — Chinese

language

Qwen2 7B Instruct

61.8

Meta Llama 3 8B Instruct

51.4

2 modelos testados

MMMLU — French

language

Qwen2 7B Instruct

60.8

Meta Llama 3 8B Instruct

55.8

2 modelos testados

MMMLU — German

language

Qwen2 7B Instruct

57.1

Meta Llama 3 8B Instruct

53.5

2 modelos testados

MMMLU — Hindi

language

Qwen2 7B Instruct

45.1

Meta Llama 3 8B Instruct

41.4

2 modelos testados

MMMLU — Indonesian

language

Qwen2 7B Instruct

54.1

Meta Llama 3 8B Instruct

51.0

2 modelos testados

MMMLU — Italian

language

Qwen2 7B Instruct

59.0

Meta Llama 3 8B Instruct

53.3

2 modelos testados

MMMLU — Japanese

language

Qwen2 7B Instruct

56.6

Meta Llama 3 8B Instruct

42.3

2 modelos testados

MMMLU — Korean

language

Qwen2 7B Instruct

54.0

Meta Llama 3 8B Instruct

46.5

2 modelos testados

MMMLU — Portuguese

language

Qwen2 7B Instruct

60.1

Meta Llama 3 8B Instruct

55.5

2 modelos testados

MMMLU — Spanish

language

Qwen2 7B Instruct

60.2

Meta Llama 3 8B Instruct

55.8

2 modelos testados

MMMLU — Swahili

language

Meta Llama 3 8B Instruct

37.5

Qwen2 7B Instruct

34.3

2 modelos testados

MMMLU — Yoruba

language

Meta Llama 3 8B Instruct

31.0

Qwen2 7B Instruct

30.2

2 modelos testados

OpenCompass — IFEval

language

Kimi K2.5

93.9

GLM 5

93.2

Step 3.5 Flash

93.2

Kimi K2 Thinking

92.4

DeepSeek V3.2 Speciale

91.7

32 modelos testados

multimodal

VideoMME

multimodal

VideoMME · multimodal benchmark testing video understanding across diverse domains, requiring temporal reasoning and cross-frame comprehension.

Gemini 1.5 Pro (Feb 2024)

66.7

Qwen2.5 72B Instruct

64.7

GPT-4o (2024-08-06)

62.5

GPT-4o (2024-11-20)

62.5

Gemini 1.5 Flash (May 2024)

60.4

8 modelos testados