Every AI Benchmark · Tracked

Leader profile · Claude Mythos Preview Compare top 3 on this benchmark

Avg48.6

SimpleBench

52 tested

SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.

Top score75.5

Leader profile · Gemini 3.1 Pro Preview Compare top 3 on this benchmark

Avg29.2

ARC-AGI-2

50 tested

ARC-AGI-2 · the second iteration of the Abstraction and Reasoning Corpus, testing novel pattern recognition and abstract reasoning without prior training data.

Top score83.3

Leader profile · GPT-5.4 Pro Compare top 3 on this benchmark

Avg14.9

ARC-AGI

49 tested

ARC-AGI · the original Abstraction and Reasoning Corpus, testing whether AI can solve novel visual pattern recognition tasks without memorization.

Top score98.0

Leader profile · Gemini 3.1 Pro Preview Compare top 3 on this benchmark

Avg45.2

Winogrande

38 tested

WinoGrande · large-scale commonsense reasoning benchmark where models must resolve ambiguous pronouns in carefully constructed sentence pairs.

Top score78.4

LeaderLlama 3.1 405B

Avg48.0

Leader profile · Llama 3.1 405B Compare top 3 on this benchmark

HellaSwag

29 tested

HellaSwag · tests commonsense reasoning by asking models to predict the most plausible continuation of everyday scenarios.

Top score93.7

LeaderGPT-4 Turbo

Avg69.5

Leader profile · GPT-4 Turbo Compare top 3 on this benchmark

BBH

24 tested

BIG-Bench Hard · a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average humans.

Top score83.3

LeaderDeepSeek V3

Avg46.8

Leader profile · DeepSeek V3 Compare top 3 on this benchmark

Chess Puzzles

24 tested

Chess Puzzles · tests strategic and tactical reasoning by having models solve chess puzzle positions, evaluating lookahead and pattern recognition abilities.

Top score58.6

Leader profile · GPT-5.4 Pro Compare top 3 on this benchmark

Avg24.5

HLE

23 tested

HLE (Humanity's Last Exam) · a reasoning benchmark designed to be the hardest public evaluation of AI. Questions span mathematics, physics, philosophy, and logic · curated to be at or beyond the frontier of human expert capability. Tested with and without tool augmentation. Claude Opus 4.7 scores 46.9% without tools and 54.7% with tools · making it one of the few benchmarks where the top score is below 60%.

Top score56.8

Leader profile · Claude Mythos Preview Compare top 3 on this benchmark

Avg15.7

Balrog

22 tested

Balrog · benchmarks AI agents on text-based adventure games, testing language understanding, strategic planning, and long-horizon reasoning.

Top score48.1

LeaderGemini 3 Flash Preview

Avg26.9

Leader profile · Gemini 3 Flash Preview Compare top 3 on this benchmark

ANLI

9 tested

ANLI (Adversarial NLI) · adversarially constructed natural language inference dataset where each round targets weaknesses found in previous model generations.

Top score37.1

LeaderGPT-3.5 Turbo (older v0613)

Avg29.3

Leader profile · GPT-3.5 Turbo (older v0613)Compare top 3 on this benchmark

Code

· 8 benchmarks

Software engineering tasks · real bug fixes, multi-language code editing, terminal proficiency.

WeirdML

70 tested

WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns.

Top score79.3

LeaderGPT-5.3-Codex

Avg39.0

Leader profile · GPT-5.3-Codex Compare top 3 on this benchmark

Aider polyglot

53 tested

Aider Polyglot · measures how well AI models can edit code across multiple programming languages using the Aider coding assistant framework.

Top score88.0

Leader profile · GPT-5 Compare top 3 on this benchmark

Avg49.2

Terminal Bench

28 tested

Terminal-Bench 2.0 · evaluates AI agents on real terminal-based coding tasks · writing scripts, debugging, running tests, and managing projects entirely through command-line interaction. Tests both code quality and terminal fluency. Claude Opus 4.7 scores 69.4%, demonstrating significant agentic terminal competence.

Top score82.7

LeaderGPT-5.5

Avg47.0

Leader profile · GPT-5.5 Compare top 3 on this benchmark

SWE-Bench verified

23 tested

SWE-bench Verified · 500 human-validated tasks from 12 real Python repositories (Django, Flask, scikit-learn, sympy, and others). Each task requires the model to produce a git patch that resolves a real GitHub issue and passes the test suite. The verified subset eliminates ambiguous tasks from the original SWE-bench. Claude Mythos Preview leads at 93.9%, crossing 90% for the first time in 2026. Opus 4.6 scores 80.8%. The benchmark remains the most-cited evaluation for code-generation capability.

Top score93.9

Leader profile · Claude Mythos Preview Compare top 3 on this benchmark

Avg69.6

Cybench

20 tested

Cybench · evaluates AI on real Capture-The-Flag cybersecurity challenges, testing vulnerability analysis, exploitation, and security reasoning.

Top score93.0

LeaderClaude Opus 4.6

Avg28.5

Leader profile · Claude Opus 4.6 Compare top 3 on this benchmark

SWE-Bench Verified (Bash Only)

19 tested

SWE-Bench Verified (Bash Only) · a curated subset of SWE-bench where models fix real Python repository bugs using only bash commands, no agent frameworks.

Top score74.4

Leader profile · Claude Opus 4.5 Compare top 3 on this benchmark

Avg49.3

GSO-Bench

18 tested

GSO-Bench · evaluates AI models on real-world open-source software engineering tasks, testing the ability to understand and resolve actual GitHub issues.

Top score33.3

LeaderClaude Opus 4.6

Avg10.8

Leader profile · Claude Opus 4.6 Compare top 3 on this benchmark

CadEval

15 tested

CadEval · evaluates the ability to generate and reason about Computer-Aided Design code, testing spatial reasoning and engineering knowledge.

Top score74.0

Leadero3

Avg42.0

Leader profile · o3 Compare top 3 on this benchmark

Math

· 5 benchmarks

Grade-school arithmetic through competition math and frontier research problems.

OTIS Mock AIME 2024-2025

86 tested

OTIS Mock AIME 2024-2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.

Top score96.1

LeaderGPT-5.2

Avg42.9

Leader profile · GPT-5.2 Compare top 3 on this benchmark

MATH level 5

72 tested

MATH Level 5 · the hardest tier of the MATH benchmark, featuring competition-level problems from AMC, AIME, and Olympiad-style mathematics.

Top score98.1

Leader profile · GPT-5 Compare top 3 on this benchmark

Avg57.8

FrontierMath-2025-02-28-Private

54 tested

FrontierMath (Feb 2025) · original research-level math problems created by mathematicians, testing capabilities at the boundary of current AI mathematical reasoning.

Top score50.0

Leader profile · GPT-5.4 Pro Compare top 3 on this benchmark

Avg14.1

FrontierMath-Tier-4-2025-07-01-Private

37 tested

FrontierMath Tier 4 (Jul 2025) · the most challenging tier of frontier mathematics, containing problems that push the absolute limits of AI mathematical reasoning.

Top score37.5

Leader profile · GPT-5.4 Pro Compare top 3 on this benchmark

Avg8.1

GSM8K

LeaderGPT-4 (older v0314)

Grade School Math 8K · 8,500 linguistically diverse grade-school math word problems that require multi-step reasoning to solve.

Top score92.0

Avg57.7

Leader profile · GPT-4 (older v0314)Compare top 3 on this benchmark

Multimodal

· 5 benchmarks

Vision, video, and cross-modal tasks combining text with other input types.

VPCT

22 tested

VPCT (Visual Pattern Completion Test) · tests visual reasoning and pattern recognition by having models complete visual sequences and transformations.

Top score86.5

LeaderGemini 3 Pro

Avg22.2

Leader profile · Gemini 3 Pro Compare top 3 on this benchmark

VideoMME

8 tested

VideoMME · multimodal benchmark testing video understanding across diverse domains, requiring temporal reasoning and cross-frame comprehension.

Top score66.7

LeaderGemini 1.5 Pro (Feb 2024)

Avg58.7

Leader profile · Gemini 1.5 Pro (Feb 2024)Compare top 3 on this benchmark

ScienceQA

5 tested

ScienceQA · multimodal science questions spanning natural science, social science, and language science with diverse question formats and image context.

Top score84.7

LeaderGPT-4o (2024-05-13)

Avg59.5

Leader profile · GPT-4o (2024-05-13)Compare top 3 on this benchmark

CharXiv Reasoning

CharXiv Reasoning · tests a model's ability to understand and reason about charts, plots, and figures extracted from real arXiv scientific papers. The model must answer questions that require reading axes, comparing data series, identifying trends, and drawing conclusions from visual scientific data. This benchmark specifically measures the intersection of visual understanding and scientific reasoning · a critical capability for research assistants and document analysis. Performance varies dramatically with and without tool use, making it a key differentiator for multimodal and agentic AI systems.

Top score86.1

Avg86.1

CharXiv Reasoning (with tools)

CharXiv Reasoning (with tools) · the tool-augmented variant of CharXiv Reasoning. Models can use code execution, image processing, and other tools to analyze charts from arXiv papers. Claude Mythos Preview scores 93.2% with tools vs 86.1% without · demonstrating how tool use dramatically improves visual scientific reasoning. The gap between tool-augmented and bare performance is a key signal for agent capability.

Top score93.2

Avg93.2

Leader profile · GPT-5.4 Compare top 3 on this benchmark

Agent

· 4 benchmarks

Long-horizon tool use · multi-step autonomous tasks in realistic environments.

APEX-Agents

17 tested

APEX-Agents · evaluates AI agents on complex, multi-step tasks requiring planning, tool use, and autonomous decision-making in realistic environments.

Top score35.9

LeaderGPT-5.4

Avg18.5

The Agent Company

13 tested

The Agent Company · tests AI agents on realistic corporate tasks like email management, code review, data analysis, and cross-tool workflows.

Top score42.9

LeaderDeepSeek V3.2 Exp

Avg19.0

Leader profile · DeepSeek V3.2 Exp Compare top 3 on this benchmark

OSWorld

9 tested

OSWorld · tests AI agents on real-world computer tasks across operating systems, including web browsing, file management, and application use.

Top score79.6

Leader profile · Claude Mythos Preview Compare top 3 on this benchmark

Avg50.9

GraphWalks BFS 256K-1M

GraphWalks BFS 256K-1M · a long-context reasoning benchmark created by OpenAI that tests whether a model can perform breadth-first search (BFS) traversal across massive graphs encoded in 256,000 to 1,024,000 tokens of context. The model receives a graph represented as an edge list and must follow parent-child relationships across the entire extended context window. This is not simple retrieval · it requires true relational reasoning over hundreds of thousands of tokens. The dataset includes 100 problems at 256K context and 100 problems at 1,024K context. Claude Mythos Preview leads at 80.0%, more than doubling Opus 4.6 (38.7%) and far exceeding GPT-5.4 (21.4%). The massive performance gap between models makes this one of the most discriminating benchmarks for real long-context capability in 2026.

Top score80.0

Avg80.0

Chatbot Arena Elo · Overall

Knowledge

· 95 benchmarks

Factual recall, reading comprehension, creative writing, and game-playing tasks.

113 tested

Top score1502.8

Leader profile · Claude Opus 4.6 (Fast)Compare top 3 on this benchmark

Avg1354.8

BBH (HuggingFace)

73 tested

Top score61.9

LeaderQwen2.5 72B Instruct

Avg25.5

Leader profile · Qwen2.5 72B Instruct Compare top 3 on this benchmark

IFEval

73 tested

Top score90.0

LeaderLlama 3.3 70B Instruct

Avg43.5

Leader profile · Llama 3.3 70B Instruct Compare top 3 on this benchmark

Leader profile · Qwen2-72B Compare top 3 on this benchmark

MUSR

73 tested

Top score28.7

Leader profile · DeepSeek R1 Distill Qwen 14B Compare top 3 on this benchmark

Avg9.4

MATH Level 5

70 tested

Top score62.5

LeaderQwen2.5 32B Instruct

Avg17.2

Leader profile · Qwen2.5 32B Instruct Compare top 3 on this benchmark

Artificial Analysis · Quality Index

68 tested

Top score57.2

Leader profile · Gemini 3.1 Pro Preview Compare top 3 on this benchmark

Avg31.4

GPQA

67 tested

Top score19.7

LeaderMeta Llama 3 8B

Avg6.7

Leader profile · Meta Llama 3 8B Compare top 3 on this benchmark

MMLU

67 tested

Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.

Top score82.9

LeaderDeepSeek V3

Avg57.1

Leader profile · DeepSeek V3 Compare top 3 on this benchmark

Artificial Analysis · Coding Index

66 tested

Artificial Analysis Coding Index · a composite score that aggregates performance across multiple coding benchmarks into a single index. Tracks code generation quality, debugging ability, multi-language competence, and real-world software engineering tasks. Used by Artificial Analysis to rank model coding capability in a normalized, comparable format. Useful for developers choosing between models for coding-heavy workloads.

Top score57.3

LeaderGPT-5.4

Avg27.5

Leader profile · GPT-5.4 Compare top 3 on this benchmark

Artificial Analysis · Agentic Index

62 tested

Artificial Analysis Agentic Index · a composite score measuring how well a model performs in agentic workflows · multi-step tool use, planning, error recovery, and autonomous task completion. Aggregates results from multiple agentic benchmarks including SWE-bench, tool-use tests, and planning evaluations. The canonical single-number metric for "how good is this model as an agent?"

Top score69.4

LeaderGPT-5.4

Avg36.9

Leader profile · GPT-5.4 Compare top 3 on this benchmark

Fiction.LiveBench

41 tested

Fiction.LiveBench · a continuously updated benchmark using recently published fiction to test reading comprehension and reasoning, preventing data contamination.

Top score97.2

Leader profile · GPT-5 Compare top 3 on this benchmark

Avg62.5

Lech Mazur Writing

39 tested

Lech Mazur Writing · evaluates creative writing ability, assessing prose quality, narrative coherence, and stylistic sophistication.

Top score87.2

Leader profile · GPT-5 Compare top 3 on this benchmark

Avg76.6

ARC AI2

35 tested

AI2 Reasoning Challenge · tests grade-school level science knowledge with multiple-choice questions requiring reasoning beyond simple retrieval.

Top score93.7

LeaderDeepSeek V3

Avg53.0

Leader profile · DeepSeek V3 Compare top 3 on this benchmark

Leader profile · Gemini 3 Pro Compare top 3 on this benchmark

HELM · IFEval

34 tested

Top score95.1

LeaderGrok 3 Mini Beta

Avg85.2

Leader profile · Grok 3 Mini Beta Compare top 3 on this benchmark

Leader profile · Gemini 3 Pro Compare top 3 on this benchmark

Leader profile · GPT-5 Mini Compare top 3 on this benchmark

Leader profile · GPT-5.1 Compare top 3 on this benchmark

OpenCompass · AIME2025

LeaderDeepSeek V3.2 Speciale

Top score96.0

Avg80.4

Leader profile · DeepSeek V3.2 Speciale Compare top 3 on this benchmark

OpenCompass · GPQA-Diamond

Leader profile · Qwen3.5 397B A17B Compare top 3 on this benchmark

Top score88.4

LeaderQwen3.5 397B A17B

Avg75.1

OpenCompass · HLE

LeaderDeepSeek V3.2 Speciale

Top score28.6

Avg15.4

Leader profile · DeepSeek V3.2 Speciale Compare top 3 on this benchmark

Leader profile · Kimi K2.5 Compare top 3 on this benchmark

OpenCompass · LiveCodeBenchV6

Leader profile · GLM 5 Compare top 3 on this benchmark

OpenCompass · MMLU-Pro

Leader profile · Qwen3.5 397B A17B Compare top 3 on this benchmark

Top score87.6

LeaderQwen3.5 397B A17B

Avg79.9

SimpleQA Verified

SimpleQA Verified · short factual questions with verified answers, measuring factual accuracy and the tendency to hallucinate or provide incorrect information.

Top score77.3

Leader profile · Gemini 3.1 Pro Preview Compare top 3 on this benchmark

Avg39.0

LiveBench · Agentic Coding

29 tested

Top score56.7

LeaderGPT-5.1-Codex-Max

Avg33.5

Leader profile · GPT-5.1-Codex-Max Compare top 3 on this benchmark

Leader profile · GPT-5.2-Codex Compare top 3 on this benchmark

LiveBench · Data Analysis

Leader profile · GPT-5.2-Codex Compare top 3 on this benchmark

Leader profile · GLM 5.1 Compare top 3 on this benchmark

Leader profile · GLM 5 Compare top 3 on this benchmark

LiveBench · Mathematics

Leader profile · GPT-5.2-Codex Compare top 3 on this benchmark

Leader profile · GPT-5.2-Codex Compare top 3 on this benchmark

LiveBench · Reasoning

29 tested

Top score84.6

LeaderGPT-5.1-Codex-Max

Avg56.0

Leader profile · GPT-5.1-Codex-Max Compare top 3 on this benchmark

Aider · Code Editing

27 tested

Top score84.2

LeaderClaude 3.5 Sonnet

Avg59.9

Leader profile · Claude 3.5 Sonnet Compare top 3 on this benchmark

Chatbot Arena Elo · Coding

27 tested

Top score1546.2

Leader profile · Claude Opus 4.6 (Fast)Compare top 3 on this benchmark

Avg1373.6

GeoBench

26 tested

GeoBench · tests geographic knowledge and spatial reasoning across countries, landmarks, coordinates, and geopolitical understanding.

Top score88.0

LeaderGemini 3 Flash Preview

Avg65.0

Leader profile · Gemini 3 Flash Preview Compare top 3 on this benchmark

PIQA

25 tested

PIQA (Physical Interaction QA) · tests intuitive physical reasoning by asking models to select the correct approach for everyday physical tasks.

Top score77.4

LeaderGPT-4o-mini

Avg63.7

Leader profile · GPT-4o-mini Compare top 3 on this benchmark

TriviaQA

20 tested

TriviaQA · reading comprehension benchmark with trivia questions, requiring models to find and reason over evidence from provided documents.

Top score87.5

LeaderClaude 2

Avg74.3

Leader profile · Claude 2 Compare top 3 on this benchmark

OpenBookQA

19 tested

OpenBookQA · science questions that require combining a given core fact with broad common knowledge, mimicking an open-book exam setting.

Top score84.0

Leaderphi-3-mini 3.8B

Avg52.9

Leader profile · phi-3-mini 3.8B Compare top 3 on this benchmark

PostTrainBench

15 tested

Top score23.2

LeaderClaude Opus 4.6

Avg14.8

Leader profile · Claude Opus 4.6 Compare top 3 on this benchmark

DeepResearch Bench

13 tested

DeepResearch Bench · evaluates AI on complex multi-step research tasks requiring information gathering, synthesis, and producing comprehensive analyses.

Top score55.1

Leader profile · GPT-5 Compare top 3 on this benchmark

Avg43.9

JCommonsenseQA

Top score93.7

Leader profile · DeepSeek R1 Distill Qwen 14B Compare top 3 on this benchmark

Avg67.0

JMMLU

Top score63.4

Leader profile · DeepSeek R1 Distill Qwen 14B Compare top 3 on this benchmark

Avg42.9

JNLI

Top score82.4

Leader profile · DeepSeek R1 Distill Qwen 14B Compare top 3 on this benchmark

Avg60.7

JSQuAD

LeaderQwen2 VL 7B Instruct

Top score89.9

Avg78.4

Leader profile · Qwen2 VL 7B Instruct Compare top 3 on this benchmark

LLM-JP · Overall

Top score56.8

Leader profile · DeepSeek R1 Distill Qwen 14B Compare top 3 on this benchmark

Avg42.9

Leader profile · Qwen2-72B Compare top 3 on this benchmark

LAMBADA

7 tested

LAMBADA · measures the ability to predict the final word of a passage, requiring broad contextual understanding across long text spans.

Top score79.8

LeaderFalcon-180B

Avg73.9

Leader profile · Falcon-180B Compare top 3 on this benchmark

Professional Reasoning · Finance

5 tested

Top score53.3

Leader profile · Claude Opus 4.6 (Fast)Compare top 3 on this benchmark

Avg50.5

Professional Reasoning · Legal

5 tested

Top score52.3

Leader profile · Claude Opus 4.6 (Fast)Compare top 3 on this benchmark

Avg49.9

AudioMultiChallenge · Text Output

Leader profile · Gemini 2.5 Pro Compare top 3 on this benchmark

Fortress

3 tested

Top score13.6

Leader profile · Claude Opus 4.5 Compare top 3 on this benchmark

Avg11.6

SWE Atlas · Codebase QnA

3 tested

Top score33.3

Leader profile · Claude Opus 4.6 (Fast)Compare top 3 on this benchmark

Avg32.4

Leader profile · Gemini 2.5 Pro Compare top 2 on this benchmark

Leader profile · GPT-4 Compare top 2 on this benchmark

CSQA2

LeaderGPT-3.5 Turbo (older v0613)

Top score14.0

Avg7.0

Leader profile · GPT-3.5 Turbo (older v0613)Compare top 2 on this benchmark

EnigmaEval

Top score19.8

Leader profile · Gemini 3.1 Pro Preview Compare top 2 on this benchmark

Avg16.4

MASK

Top score96.3

Leader profile · Claude Opus 4.6 (Fast)Compare top 2 on this benchmark

Avg95.8

MCP Atlas

Top score62.3

Leader profile · Claude Opus 4.5 Compare top 2 on this benchmark

Avg59.8

MMMLU · Arabic

Top score50.7

Leader profile · Qwen2 7B Instruct Compare top 2 on this benchmark

Avg45.6

MMMLU · Bengali

Top score43.4

Leader profile · Qwen2 7B Instruct Compare top 2 on this benchmark

Avg39.9

MMMLU · Chinese

Top score61.8

Leader profile · Qwen2 7B Instruct Compare top 2 on this benchmark

Avg56.6

MMMLU · French

Top score60.8

Leader profile · Qwen2 7B Instruct Compare top 2 on this benchmark

Avg58.3

MMMLU · German

Top score57.1

Leader profile · Qwen2 7B Instruct Compare top 2 on this benchmark

Avg55.3

MMMLU · Hindi

Top score45.1

Leader profile · Qwen2 7B Instruct Compare top 2 on this benchmark

Avg43.3

MMMLU · Indonesian

Top score54.1

Leader profile · Qwen2 7B Instruct Compare top 2 on this benchmark

Avg52.5

MMMLU · Italian

Top score59.0

Leader profile · Qwen2 7B Instruct Compare top 2 on this benchmark

Avg56.1

MMMLU · Japanese

Top score56.6

Leader profile · Qwen2 7B Instruct Compare top 2 on this benchmark

Avg49.5

MMMLU · Korean

Top score54.0

Leader profile · Qwen2 7B Instruct Compare top 2 on this benchmark

Avg50.3

MMMLU · Portuguese

Top score60.1

Leader profile · Qwen2 7B Instruct Compare top 2 on this benchmark

Avg57.8

MMMLU · Spanish

Top score60.2

Leader profile · Qwen2 7B Instruct Compare top 2 on this benchmark

Avg58.0

MMMLU · Swahili

LeaderMeta Llama 3 8B Instruct

Top score37.5

Avg35.9

Leader profile · Meta Llama 3 8B Instruct Compare top 2 on this benchmark

MMMLU · Yoruba

LeaderMeta Llama 3 8B Instruct

Top score31.0

Avg30.6

Leader profile · Meta Llama 3 8B Instruct Compare top 2 on this benchmark

SciPredict

Top score23.1

Leader profile · Claude Opus 4.5 Compare top 2 on this benchmark

Avg22.6

SWE Atlas · Test Writing

Top score36.7

Leader profile · Claude Opus 4.6 (Fast)Compare top 2 on this benchmark

Avg34.2

SWE-Bench Pro (Private)

Top score23.4

Leader profile · Claude Opus 4.5 Compare top 2 on this benchmark

Avg16.8

SWE-Bench Pro (Public)

Top score45.9

Leader profile · Claude Opus 4.5 Compare top 2 on this benchmark

Avg43.5

TutorBench

LeaderGemini 2.5 Pro Preview 06-05

Top score55.6

Avg55.1

Leader profile · Gemini 2.5 Pro Preview 06-05 Compare top 2 on this benchmark

VISTA

LeaderGemini 2.5 Pro Preview 06-05

Top score54.6

Avg53.2

Leader profile · Gemini 2.5 Pro Preview 06-05 Compare top 2 on this benchmark

VisualToolBench (VTB)

Top score29.0

Leader profile · Gemini 3.1 Pro Preview Compare top 2 on this benchmark

Avg28.2

HLE (with tools)

Top score64.7

Avg64.7

MMMLU

Top score92.7

Avg92.7

MultiChallenge

Top score71.4

Leader profile · Gemini 3.1 Pro Preview

Avg71.4

MultiNRC

Top score64.7

Leader profile · Gemini 3.1 Pro Preview

Avg64.7

PropensityBench

LeaderQwen2.5 32B Instruct

Top score22.9

Avg22.9

Leader profile · Qwen2.5 32B Instruct

Remote Labor Index (RLI)

Top score4.2

Leader profile · Claude Opus 4.6 (Fast)

Avg4.2

SWE-bench Multilingual

Top score87.3

Avg87.3

SWE-bench Multimodal

Top score59.0

Avg59.0

SWE-bench Pro

Top score77.8

Avg77.8

USAMO

Top score97.6

Avg97.6