벤치마크

11개 카테고리에 걸쳐 128개 벤치마크. 클릭하면 전체 순위를 확인할 수 있습니다.

coding

Aider — Code Editing

coding

84.2

Claude 3.5 Sonnet

84.2

o1-preview

79.7

GPT-4o (2024-05-13)

72.9

GPT-4o (2024-11-20)

71.4

27 테스트된 모델

Aider polyglot

coding

Aider Polyglot · measures how well AI models can edit code across multiple programming languages using the Aider coding assistant framework.

GPT-5 Chat

88.0

GPT-5

88.0

o3 Pro

84.9

Gemini 2.5 Pro

83.1

Gemini 2.5 Pro Preview 06-05

83.1

53 테스트된 모델

CadEval

coding

CadEval · evaluates the ability to generate and reason about Computer-Aided Design code, testing spatial reasoning and engineering knowledge.

74.0

Gemini 2.5 Pro

64.0

o4 Mini

62.0

56.0

o3 Mini

54.0

15 테스트된 모델

Cybench

coding

Cybench · evaluates AI on real Capture-The-Flag cybersecurity challenges, testing vulnerability analysis, exploitation, and security reasoning.

Claude Opus 4.6

93.0

Claude Opus 4.5

82.0

Claude Sonnet 4.5

60.0

Grok 4

43.0

Claude Opus 4.1

42.0

20 테스트된 모델

GSO-Bench

coding

GSO-Bench · evaluates AI models on real-world open-source software engineering tasks, testing the ability to understand and resolve actual GitHub issues.

Claude Opus 4.6

33.3

GPT-5.2

27.4

Claude Opus 4.5

26.5

Gemini 3 Pro

18.6

Claude Sonnet 4.5

14.7

18 테스트된 모델

LiveBench — Agentic Coding

coding

GPT-5.1-Codex-Max

56.7

GLM 5.1

55.0

Qwen3.6 Plus

55.0

GLM 5

55.0

GPT-5.1-Codex

53.3

29 테스트된 모델

LiveBench — Coding

coding

GPT-5.2-Codex

83.6

GPT-5.1-Codex-Max

81.4

Qwen3.6 Plus

78.2

GPT-5 Mini

76.1

DeepSeek V3.2

75.7

29 테스트된 모델

OpenCompass — LiveCodeBenchV6

coding

GLM 5

86.2

Step 3.5 Flash

83.9

GLM 4.7

83.8

Qwen3.5 397B A17B

83.0

DeepSeek V3.2 Speciale

80.9

32 테스트된 모델

SWE-bench Multilingual

coding

Claude Mythos Preview

87.3

1 테스트된 모델

SWE-bench Multimodal

coding

Claude Mythos Preview

59.0

1 테스트된 모델

SWE-bench Pro

coding

Claude Mythos Preview

77.8

1 테스트된 모델

SWE-Bench verified

coding

SWE-bench Verified · 500 human-validated tasks from 12 real Python repositories (Django, Flask, scikit-learn, sympy, and others). Each task requires the model to produce a git patch that resolves a real GitHub issue and passes the test suite. The verified subset eliminates ambiguous tasks from the original SWE-bench. Claude Mythos Preview leads at 93.9%, crossing 90% for the first time in 2026. Opus 4.6 scores 80.8%. The benchmark remains the most-cited evaluation for code-generation capability.

Claude Mythos Preview

93.9

Claude Opus 4.6

78.7

GPT-5.4

76.9

Claude Opus 4.5

76.7

Gemini 3.1 Pro Preview

75.6

23 테스트된 모델

SWE-Bench Verified (Bash Only)

coding

SWE-Bench Verified (Bash Only) · a curated subset of SWE-bench where models fix real Python repository bugs using only bash commands, no agent frameworks.

Claude Opus 4.5

74.4

GPT-5.2

71.8

Claude Sonnet 4.5

70.6

Claude Opus 4

67.6

GPT-5.1

66.0

19 테스트된 모델

Terminal Bench

coding

Terminal-Bench 2.0 · evaluates AI agents on real terminal-based coding tasks · writing scripts, debugging, running tests, and managing projects entirely through command-line interaction. Tests both code quality and terminal fluency. Claude Opus 4.7 scores 69.4%, demonstrating significant agentic terminal competence.

Claude Mythos Preview

82.0

Gemini 3.1 Pro Preview

78.4

GPT-5.3-Codex

77.3

Claude Opus 4.6

74.7

Gemini 3 Pro

69.4

27 테스트된 모델

WeirdML

coding

WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns.

GPT-5.3-Codex

79.3

Claude Opus 4.6

77.9

GPT-5.2

72.2

Gemini 3.1 Pro Preview

72.1

Gemini 3 Pro

69.9

70 테스트된 모델

knowledge

ANLI

knowledge

ANLI (Adversarial NLI) · adversarially constructed natural language inference dataset where each round targets weaknesses found in previous model generations.

phi-3-small 7.4B

37.1

GPT-3.5 Turbo (older v0613)

37.1

Llama 3 8B Instruct

36.0

phi-3-medium 14B

33.7

Mixtral 8x7B Instruct

32.8

9 테스트된 모델

ARC AI2

knowledge

AI2 Reasoning Challenge · tests grade-school level science knowledge with multiple-choice questions requiring reasoning beyond simple retrieval.

Llama 3.1 405B

93.7

DeepSeek V3

93.7

Qwen2.5 72B Instruct

92.7

DeepSeek-V2 (MoE-236B, May 2024)

89.6

phi-3-medium 14B

88.8

35 테스트된 모델

AudioMultiChallenge

knowledge

Gemini 2.5 Pro

46.9

Gemini 2.5 Flash

40.0

2 테스트된 모델

AudioMultiChallenge — Audio Output

knowledge

0 테스트된 모델

AudioMultiChallenge — Text Output

knowledge

Gemini 2.5 Pro

46.9

Gemini 2.5 Flash

40.0

Voxtral Small 24B 2507

26.3

3 테스트된 모델

Balrog

knowledge

Balrog · benchmarks AI agents on text-based adventure games, testing language understanding, strategic planning, and long-horizon reasoning.

Gemini 3 Flash Preview

48.1

Grok 4

43.6

Gemini 2.5 Pro

43.3

34.9

Gemini 2.5 Flash

33.5

22 테스트된 모델

C-Eval

knowledge

GPT-4

68.7

LLaMA-13B

38.8

2 테스트된 모델

Chess Puzzles

knowledge

Chess Puzzles · tests strategic and tactical reasoning by having models solve chess puzzle positions, evaluating lookahead and pattern recognition abilities.

GPT-5.4 Pro

58.6

Gemini 3.1 Pro Preview

55.0

GPT-5.2

49.0

GPT-5.4

44.0

Gemini 3 Flash Preview

38.0

24 테스트된 모델

CMMLU

knowledge

Qwen2-72B

89.7

Qwen2.5 72B Instruct

85.7

GPT-4 Turbo

71.0

Llama 3.1 70B Instruct

64.4

Qwen-14B

58.7

8 테스트된 모델

CSQA2

knowledge

GPT-3.5 Turbo (older v0613)

14.0

Llama 2-13B

0.1

2 테스트된 모델

DeepResearch Bench

knowledge

DeepResearch Bench · evaluates AI on complex multi-step research tasks requiring information gathering, synthesis, and producing comprehensive analyses.

GPT-5

55.1

Claude Sonnet 4.5

52.6

Gemini 2.5 Pro

49.7

Claude Opus 4.1

49.7

Claude Opus 4

49.0

13 테스트된 모델

EnigmaEval

knowledge

Gemini 3.1 Pro Preview

19.8

13.1

2 테스트된 모델

Fiction.LiveBench

knowledge

Fiction.LiveBench · a continuously updated benchmark using recently published fiction to test reading comprehension and reasoning, preventing data contamination.

GPT-5

97.2

o3 Pro

97.2

Grok 4 Fast

94.4

Grok 4

94.4

Gemini 2.5 Pro

91.7

41 테스트된 모델

GeoBench

knowledge

GeoBench · tests geographic knowledge and spatial reasoning across countries, landmarks, coordinates, and geopolitical understanding.

Gemini 3 Flash Preview

88.0

Gemini 3 Pro

84.0

GPT-5

81.0

Gemini 2.5 Pro

81.0

80.0

26 테스트된 모델

GPQA

knowledge

Meta Llama 3 8B

19.7

Qwen2.5 72B Instruct Abliterated

19.4

Qwen2-72B

19.2

DeepSeek R1 Distill Qwen 14B

18.3

WizardLM-2 8x22B

17.6

73 테스트된 모델

GPQA diamond

knowledge

Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.

Claude Mythos Preview

94.5

GPT-5.4 Pro

92.8

Gemini 3.1 Pro Preview

92.1

GPT-5.4

91.1

Gemini 3 Pro

90.2

96 테스트된 모델

HellaSwag

knowledge

HellaSwag · tests commonsense reasoning by asking models to predict the most plausible continuation of everyday scenarios.

GPT-4 Turbo

93.7

Llama 3.1 405B

85.6

Falcon-180B

85.3

DeepSeek V3

85.2

DeepSeek-V2 (MoE-236B, May 2024)

82.8

29 테스트된 모델

HELM — GPQA

knowledge

Gemini 3 Pro

80.3

GPT-5 Chat

79.1

GPT-5 Mini

75.6

75.3

Gemini 2.5 Pro

74.9

34 테스트된 모델

HELM — MMLU-Pro

knowledge

Gemini 3 Pro

90.3

GPT-5 Chat

86.3

Gemini 2.5 Pro

86.3

85.9

Grok 4

85.1

34 테스트된 모델

HLE

knowledge

HLE (Humanity's Last Exam) · a reasoning benchmark designed to be the hardest public evaluation of AI. Questions span mathematics, physics, philosophy, and logic · curated to be at or beyond the frontier of human expert capability. Tested with and without tool augmentation. Claude Opus 4.7 scores 46.9% without tools and 54.7% with tools · making it one of the few benchmarks where the top score is below 60%.

Claude Mythos Preview

56.8

Gemini 3 Pro

34.4

Claude Opus 4.6

31.1

GPT-5 Pro

28.2

GPT-5.2

24.2

23 테스트된 모델

Humanity's Last Exam

knowledge

0 테스트된 모델

Humanity's Last Exam (Text Only)

knowledge

0 테스트된 모델

LAMBADA

knowledge

LAMBADA · measures the ability to predict the final word of a passage, requiring broad contextual understanding across long text spans.

Falcon-180B

79.8

Llama 2-13B

76.5

LLaMA-13B

75.2

Baichuan 2-7B

73.3

Stable Beluga 2

71.3

7 테스트된 모델

Lech Mazur Writing

knowledge

Lech Mazur Writing · evaluates creative writing ability, assessing prose quality, narrative coherence, and stylistic sophistication.

GPT-5

87.2

Qwen3 Max

87.1

Kimi K2 0711

86.9

o3 Pro

86.3

Gemini 2.5 Pro

86.0

39 테스트된 모델

LiveBench — Overall

knowledge

GPT-5.2-Codex

74.3

GPT-5.1-Codex-Max

72.0

Qwen3.6 Plus

70.8

GLM 5.1

70.2

GLM 5

68.8

29 테스트된 모델

MMLU

knowledge

Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.

DeepSeek V3

82.9

Claude 3.5 Sonnet

82.0

GPT-4 (older v0314)

81.9

Llama 3.3 70B Instruct (free)

81.7

Qwen2.5 72B Instruct

80.4

67 테스트된 모델

MMLU-PRO

knowledge

Qwen2-72B

52.6

Qwen2.5 32B Instruct

51.9

Qwen2.5 72B Instruct

51.4

Qwen2.5 72B Instruct Abliterated

50.4

Phi 4

48.6

73 테스트된 모델

MMMLU

knowledge

Claude Mythos Preview

92.7

1 테스트된 모델

MultiChallenge

knowledge

Gemini 3.1 Pro Preview

71.4

1 테스트된 모델

MultiNRC

knowledge

Gemini 3.1 Pro Preview

64.7

1 테스트된 모델

OpenBookQA

knowledge

OpenBookQA · science questions that require combining a given core fact with broad common knowledge, mimicking an open-book exam setting.

phi-3-mini 3.8B

84.0

phi-3-small 7.4B

84.0

phi-3-medium 14B

83.2

GPT-3.5 Turbo (older v0613)

81.3

Mixtral 8x7B Instruct

81.1

19 테스트된 모델

OpenCompass — GPQA-Diamond

knowledge

Qwen3.5 397B A17B

88.4

Kimi K2.5

88.1

GLM 4.7

86.9

DeepSeek V3.2 Speciale

86.7

GLM 5

85.3

32 테스트된 모델

OpenCompass — HLE

knowledge

DeepSeek V3.2 Speciale

28.6

Kimi K2.5

28.6

GLM 5

28.1

Qwen3.5 397B A17B

27.5

GLM 4.7

25.4

32 테스트된 모델

OpenCompass — MMLU-Pro

knowledge

Qwen3.5 397B A17B

87.6

Kimi K2.5

86.2

DeepSeek V3.2

85.8

Gemini 2.5 Pro

85.8

DeepSeek V3.2 Speciale

85.5

32 테스트된 모델

PIQA

knowledge

PIQA (Physical Interaction QA) · tests intuitive physical reasoning by asking models to select the correct approach for everyday physical tasks.

GPT-4o-mini (2024-07-18)

77.4

GPT-4o-mini

77.4

Gemini 1.5 Flash (May 2024)

75.0

Llama 3.1 405B

71.8

Falcon-180B

69.8

25 테스트된 모델

PostTrainBench

knowledge

Claude Opus 4.6

23.2

Gemini 3.1 Pro Preview

21.6

GPT-5.2

21.4

GPT-5.4

20.2

Gemini 3 Pro

18.1

15 테스트된 모델

Professional Reasoning — Finance

knowledge

Claude Opus 4.6 (Fast)

53.3

GPT-5

51.3

GPT-5 Pro

51.1

o3 Pro

49.1

47.7

5 테스트된 모델

Professional Reasoning — Legal

knowledge

Claude Opus 4.6 (Fast)

52.3

GPT-5 Pro

49.9

o3 Pro

49.7

GPT-5

49.0

48.6

5 테스트된 모델

ScienceQA

knowledge

ScienceQA · multimodal science questions spanning natural science, social science, and language science with diverse question formats and image context.

GPT-4o (2024-05-13)

84.7

GPT-4o (2024-11-20)

84.7

Claude 3 Haiku

62.7

Llama 2-13B

41.0

LLaMA-13B

24.4

5 테스트된 모델

SciPredict

knowledge

Claude Opus 4.5

23.1

Gemini 3 Flash Preview

22.2

2 테스트된 모델

SimpleQA Verified

knowledge

SimpleQA Verified · short factual questions with verified answers, measuring factual accuracy and the tendency to hallucinate or provide incorrect information.

Gemini 3.1 Pro Preview

77.3

Gemini 3 Pro

72.9

Qwen3 Max

67.5

Gemini 3 Flash Preview

67.4

Muse Spark

66.3

32 테스트된 모델

TriviaQA

knowledge

TriviaQA · reading comprehension benchmark with trivia questions, requiring models to find and reason over evidence from provided documents.

Claude 2

87.5

GPT-3.5 Turbo (older v0613)

85.8

GPT-4 Turbo

84.8

DeepSeek V3

82.9

Llama 3.1 405B

82.7

20 테스트된 모델

TutorBench

knowledge

Gemini 2.5 Pro Preview 06-05

55.6

Kimi K2.5

54.6

2 테스트된 모델

VISTA

knowledge

Gemini 2.5 Pro Preview 06-05

54.6

o4 Mini

51.8

2 테스트된 모델

VisualToolBench (VTB)

knowledge

Gemini 3.1 Pro Preview

29.0

Claude Opus 4.6 (Fast)

27.5

2 테스트된 모델

VPCT

knowledge

VPCT (Visual Pattern Completion Test) · tests visual reasoning and pattern recognition by having models complete visual sequences and transformations.

Gemini 3 Pro

86.5

GPT-5.2

76.0

Gemini 3 Flash Preview

58.9

GPT-5

49.0

GPT-5.1

38.0

22 테스트된 모델

Winogrande

knowledge

WinoGrande · large-scale commonsense reasoning benchmark where models must resolve ambiguous pronouns in carefully constructed sentence pairs.

Llama 3.1 405B

78.4

Claude 3 Opus

77.0

GPT-4 (older v0314)

75.0

GPT-4 Turbo

75.0

Falcon-180B

74.2

38 테스트된 모델

agentic

APEX-Agents

agentic

APEX-Agents · evaluates AI agents on complex, multi-step tasks requiring planning, tool use, and autonomous decision-making in realistic environments.

GPT-5.4

35.9

GPT-5.2

34.3

Gemini 3.1 Pro Preview

33.5

GPT-5.3-Codex

31.7

Claude Opus 4.6

31.7

17 테스트된 모델

MCP Atlas

agentic

Claude Opus 4.5

62.3

Gemini 3 Flash Preview

57.4

2 테스트된 모델

OSWorld

agentic

OSWorld · tests AI agents on real-world computer tasks across operating systems, including web browsing, file management, and application use.

Claude Mythos Preview

79.6

Claude Opus 4.5

66.3

Kimi K2.5

63.3

Claude Sonnet 4.5

62.9

Claude Sonnet 4

43.9

8 테스트된 모델

Remote Labor Index (RLI)

agentic

Claude Opus 4.6 (Fast)

4.2

1 테스트된 모델

SWE Atlas — Codebase QnA

agentic

Claude Opus 4.6 (Fast)

33.3

GPT-5.3-Codex

32.6

Claude Sonnet 4.6

31.2

3 테스트된 모델

SWE Atlas — Test Writing

agentic

Claude Opus 4.6 (Fast)

36.7

Claude Sonnet 4.6

31.8

2 테스트된 모델

SWE-Bench Pro (Private)

agentic

Claude Opus 4.5

23.4

Gemini 2.5 Pro Preview 06-05

10.1

2 테스트된 모델

SWE-Bench Pro (Public)

agentic

Claude Opus 4.5

45.9

GPT-5.2-Codex

41.0

2 테스트된 모델

The Agent Company

agentic

The Agent Company · tests AI agents on realistic corporate tasks like email management, code review, data analysis, and cross-tool workflows.

DeepSeek V3.2 Exp

42.9

Gemini 2.5 Flash

41.1

Claude Sonnet 4

33.1

Claude 3.7 Sonnet

30.9

Gemini 2.5 Pro

30.3

13 테스트된 모델

reasoning

ARC-AGI

reasoning

ARC-AGI · the original Abstraction and Reasoning Corpus, testing whether AI can solve novel visual pattern recognition tasks without memorization.

Gemini 3.1 Pro Preview

98.0

GPT-5.4 Pro

94.5

Claude Opus 4.6

94.0

GPT-5.4

93.7

GPT-5.2 Pro

90.5

48 테스트된 모델

ARC-AGI-2

reasoning

ARC-AGI-2 · the second iteration of the Abstraction and Reasoning Corpus, testing novel pattern recognition and abstract reasoning without prior training data.

GPT-5.4 Pro

83.3

Gemini 3.1 Pro Preview

77.1

GPT-5.4

74.0

Claude Opus 4.6

69.2

Claude Sonnet 4.6

60.4

50 테스트된 모델

BBH

reasoning

BIG-Bench Hard · a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average humans.

DeepSeek V3

83.3

Gemini 1.5 Pro (Feb 2024)

78.7

Llama 3.1 405B

77.2

phi-3-medium 14B

75.2

Qwen2.5 72B Instruct

73.1

24 테스트된 모델

CharXiv Reasoning

reasoning

CharXiv Reasoning · tests a model's ability to understand and reason about charts, plots, and figures extracted from real arXiv scientific papers. The model must answer questions that require reading axes, comparing data series, identifying trends, and drawing conclusions from visual scientific data. This benchmark specifically measures the intersection of visual understanding and scientific reasoning · a critical capability for research assistants and document analysis. Performance varies dramatically with and without tool use, making it a key differentiator for multimodal and agentic AI systems.

Claude Mythos Preview

86.1

1 테스트된 모델

CharXiv Reasoning (with tools)

reasoning

CharXiv Reasoning (with tools) · the tool-augmented variant of CharXiv Reasoning. Models can use code execution, image processing, and other tools to analyze charts from arXiv papers. Claude Mythos Preview scores 93.2% with tools vs 86.1% without · demonstrating how tool use dramatically improves visual scientific reasoning. The gap between tool-augmented and bare performance is a key signal for agent capability.

Claude Mythos Preview

93.2

1 테스트된 모델

GraphWalks BFS 256K-1M

reasoning

GraphWalks BFS 256K-1M · a long-context reasoning benchmark created by OpenAI that tests whether a model can perform breadth-first search (BFS) traversal across massive graphs encoded in 256,000 to 1,024,000 tokens of context. The model receives a graph represented as an edge list and must follow parent-child relationships across the entire extended context window. This is not simple retrieval · it requires true relational reasoning over hundreds of thousands of tokens. The dataset includes 100 problems at 256K context and 100 problems at 1,024K context. Claude Mythos Preview leads at 80.0%, more than doubling Opus 4.6 (38.7%) and far exceeding GPT-5.4 (21.4%). The massive performance gap between models makes this one of the most discriminating benchmarks for real long-context capability in 2026.

Claude Mythos Preview

80.0

1 테스트된 모델

HELM — WildBench

reasoning

GPT-5.1

86.3

Kimi K2 0711

86.2

86.1

Gemini 3 Pro

85.9

GPT-5 Chat

85.7

34 테스트된 모델

HLE (with tools)

reasoning

Claude Mythos Preview

64.7

1 테스트된 모델

LiveBench — Data Analysis

reasoning

GPT-5.2-Codex

78.2

Qwen3.6 Plus

69.9

GLM 5

67.9

GLM 5.1

63.2

GPT-5.1-Codex

60.8

29 테스트된 모델

LiveBench — Reasoning

reasoning

GPT-5.1-Codex-Max

84.6

GPT-5.1-Codex

82.0

GPT-5.2-Codex

77.7

Qwen3.6 Plus

75.8

MiniMax M2.7

74.8

29 테스트된 모델

MUSR

reasoning

DeepSeek R1 Distill Qwen 14B

28.7

Hermes 3 70B Instruct

23.4

Llama 3 8B Instruct

19.9

Qwen2-72B

19.7

Stable Beluga 2

18.6

73 테스트된 모델

SimpleBench

reasoning

SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.

Gemini 3.1 Pro Preview

75.5

Gemini 3 Pro

71.7

GPT-5.4 Pro

68.9

Claude Opus 4.6

61.1

Gemini 2.5 Pro

54.9

52 테스트된 모델

speed

Artificial Analysis — Agentic Index

speed

Artificial Analysis Agentic Index · a composite score measuring how well a model performs in agentic workflows · multi-step tool use, planning, error recovery, and autonomous task completion. Aggregates results from multiple agentic benchmarks including SWE-bench, tool-use tests, and planning evaluations. The canonical single-number metric for "how good is this model as an agent?"

GPT-5.4

69.4

Claude Opus 4.6 (Fast)

67.6

GLM 5.1

67.0

GLM 5 Turbo

63.1

Claude Sonnet 4.6

63.0

66 테스트된 모델

Artificial Analysis — Coding Index

speed

Artificial Analysis Coding Index · a composite score that aggregates performance across multiple coding benchmarks into a single index. Tracks code generation quality, debugging ability, multi-language competence, and real-world software engineering tasks. Used by Artificial Analysis to rank model coding capability in a normalized, comparable format. Useful for developers choosing between models for coding-heavy workloads.

GPT-5.4

57.3

Gemini 3.1 Pro Preview

55.5

GPT-5.3-Codex

53.1

GPT-5.4 Mini

51.5

Claude Sonnet 4.6

50.9

67 테스트된 모델

Artificial Analysis — Quality Index

speed

Gemini 3.1 Pro Preview

57.2

GPT-5.4

57.2

GPT-5.3-Codex

54.0

Claude Opus 4.6 (Fast)

53.0

Muse Spark

52.1

68 테스트된 모델

general

BBH (HuggingFace)

general

Qwen2.5 72B Instruct

61.9

Qwen2.5 72B Instruct Abliterated

60.5

Llama 3.3 70B Instruct

56.6

Qwen2.5 32B Instruct

56.5

Llama 3.1 70B Instruct

55.9

73 테스트된 모델

arena

Chatbot Arena Elo — Coding

arena

Claude Opus 4.6 (Fast)

1546.2

Claude Opus 4.6

1542.9

Claude Sonnet 4.6

1521.0

Claude Opus 4.5

1465.2

Gemini 3.1 Pro Preview

1455.7

27 테스트된 모델

Chatbot Arena Elo — Overall

arena

Claude Opus 4.6 (Fast)

1502.8

Claude Opus 4.6

1496.6

Gemini 3.1 Pro Preview

1492.6

Gemini 3 Pro

1486.2

Gemini 3 Flash Preview

1473.9

113 테스트된 모델

safety

Fortress

safety

Claude Opus 4.5

13.6

Claude 3.5 Sonnet

13.0

gpt-oss-120b

8.2

3 테스트된 모델

MASK

safety

Claude Opus 4.6 (Fast)

96.3

Claude Sonnet 4

95.3

2 테스트된 모델

PropensityBench

safety

Qwen2.5 32B Instruct

22.9

1 테스트된 모델

math

FrontierMath-2025-02-28-Private

math

FrontierMath (Feb 2025) · original research-level math problems created by mathematicians, testing capabilities at the boundary of current AI mathematical reasoning.

GPT-5.4 Pro

50.0

GPT-5.4

47.6

Claude Opus 4.6

40.7

GPT-5.2

40.7

Muse Spark

39.0

54 테스트된 모델

FrontierMath-Tier-4-2025-07-01-Private

math

FrontierMath Tier 4 (Jul 2025) · the most challenging tier of frontier mathematics, containing problems that push the absolute limits of AI mathematical reasoning.

GPT-5.4 Pro

37.5

GPT-5.2 Pro

31.3

GPT-5.4

27.1

Claude Opus 4.6

22.9

GPT-5.2

18.8

37 테스트된 모델

GSM8K

math

Grade School Math 8K · 8,500 linguistically diverse grade-school math word problems that require multi-step reasoning to solve.

GPT-4 (older v0314)

92.0

GPT-4o-mini (2024-07-18)

91.3

GPT-4o-mini

91.3

Qwen2.5 Coder 32B Instruct

91.1

GPT-4 Turbo

90.0

32 테스트된 모델

HELM — Omni-MATH

math

GPT-5 Mini

72.2

o4 Mini

72.0

71.4

gpt-oss-120b

68.8

Kimi K2 0711

65.4

34 테스트된 모델

LiveBench — Mathematics

math

GPT-5.2-Codex

88.8

GLM 5.1

84.9

Qwen3.6 Plus

83.7

GPT-5.1-Codex-Max

83.7

GLM 5

83.5

29 테스트된 모델

MATH level 5

math

MATH Level 5 · the hardest tier of the MATH benchmark, featuring competition-level problems from AMC, AIME, and Olympiad-style mathematics.

GPT-5

98.1

GPT-5 Mini

97.8

o4 Mini

97.8

Claude Sonnet 4.5

97.7

72 테스트된 모델

MATH Level 5

math

Qwen2.5 32B Instruct

62.5

Qwen2.5 72B Instruct Abliterated

60.1

Qwen2.5 72B Instruct

59.8

DeepSeek R1 Distill Qwen 14B

57.0

Qwen2.5 14B Instruct

55.3

73 테스트된 모델

OpenCompass — AIME2025

math

DeepSeek V3.2 Speciale

96.0

GLM 5

95.8

Step 3.5 Flash

95.7

GLM 4.7

95.4

Kimi K2 Thinking

94.1

32 테스트된 모델

OTIS Mock AIME 2024-2025

math

OTIS Mock AIME 2024-2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.

GPT-5.2

96.1

Gemini 3.1 Pro Preview

95.6

GPT-5.4

95.3

Claude Opus 4.6

94.4

Gemini 3 Flash Preview

92.8

86 테스트된 모델

USAMO

math

Claude Mythos Preview

97.6

1 테스트된 모델

language

HELM — IFEval

language

Grok 3 Mini Beta

95.1

Grok 4

94.9

GPT-5.1

93.5

GPT-5 Nano

93.2

o4 Mini

92.9

34 테스트된 모델

IFEval

language

Llama 3.3 70B Instruct

90.0

Llama 3.1 70B Instruct

86.7

Qwen2.5 72B Instruct

86.4

Qwen2.5 72B Instruct Abliterated

85.9

Qwen2.5 32B Instruct

83.5

73 테스트된 모델

JCommonsenseQA

language

DeepSeek R1 Distill Qwen 14B

93.7

Qwen2 7B Instruct

89.1

Qwen2 VL 7B Instruct

87.8

Meta Llama 3 8B Instruct

87.7

Meta Llama 3 8B

82.9

11 테스트된 모델

JHumanEval

language

0 테스트된 모델

JMMLU

language

DeepSeek R1 Distill Qwen 14B

63.4

Qwen2 7B Instruct

56.5

Qwen2 VL 7B Instruct

56.3

Meta Llama 3 8B Instruct

46.7

Meta Llama 3 8B

44.7

11 테스트된 모델

JNLI

language

DeepSeek R1 Distill Qwen 14B

82.4

Qwen2 7B Instruct

81.3

Qwen2 VL 7B Instruct

74.4

DeepSeek R1 Distill Llama 8B

69.4

Meta Llama 3 8B Instruct

61.1

11 테스트된 모델

JSQuAD

language

Qwen2 VL 7B Instruct

89.9

DeepSeek R1 Distill Qwen 14B

89.8

Qwen2 7B Instruct

89.6

Meta Llama 3 8B Instruct

89.5

Meta Llama 3 8B

88.9

11 테스트된 모델

LiveBench — If

language

GLM 5.1

68.5

Gemma 4 31B

67.6

GPT-5.1-Codex-Max

67.1

GPT-5.2-Codex

66.5

GPT-5 Mini

64.2

29 테스트된 모델

LiveBench — Language

language

GLM 5

77.5

GPT-5.1-Codex-Max

75.4

Qwen3.6 Plus

75.0

GPT-5.2-Codex

73.7

GLM 5.1

71.8

29 테스트된 모델

LLM-JP — Overall

language

DeepSeek R1 Distill Qwen 14B

56.8

Qwen2 VL 7B Instruct

53.0

Qwen2 7B Instruct

51.7

Meta Llama 3 8B Instruct

49.6

Meta Llama 3 8B

48.9

11 테스트된 모델

MMMLU — Arabic

language

Qwen2 7B Instruct

50.7

Meta Llama 3 8B Instruct

40.5

2 테스트된 모델

MMMLU — Bengali

language

Qwen2 7B Instruct

43.4

Meta Llama 3 8B Instruct

36.4

2 테스트된 모델

MMMLU — Chinese

language

Qwen2 7B Instruct

61.8

Meta Llama 3 8B Instruct

51.4

2 테스트된 모델

MMMLU — French

language

Qwen2 7B Instruct

60.8

Meta Llama 3 8B Instruct

55.8

2 테스트된 모델

MMMLU — German

language

Qwen2 7B Instruct

57.1

Meta Llama 3 8B Instruct

53.5

2 테스트된 모델

MMMLU — Hindi

language

Qwen2 7B Instruct

45.1

Meta Llama 3 8B Instruct

41.4

2 테스트된 모델

MMMLU — Indonesian

language

Qwen2 7B Instruct

54.1

Meta Llama 3 8B Instruct

51.0

2 테스트된 모델

MMMLU — Italian

language

Qwen2 7B Instruct

59.0

Meta Llama 3 8B Instruct

53.3

2 테스트된 모델

MMMLU — Japanese

language

Qwen2 7B Instruct

56.6

Meta Llama 3 8B Instruct

42.3

2 테스트된 모델

MMMLU — Korean

language

Qwen2 7B Instruct

54.0

Meta Llama 3 8B Instruct

46.5

2 테스트된 모델

MMMLU — Portuguese

language

Qwen2 7B Instruct

60.1

Meta Llama 3 8B Instruct

55.5

2 테스트된 모델

MMMLU — Spanish

language

Qwen2 7B Instruct

60.2

Meta Llama 3 8B Instruct

55.8

2 테스트된 모델

MMMLU — Swahili

language

Meta Llama 3 8B Instruct

37.5

Qwen2 7B Instruct

34.3

2 테스트된 모델

MMMLU — Yoruba

language

Meta Llama 3 8B Instruct

31.0

Qwen2 7B Instruct

30.2

2 테스트된 모델

OpenCompass — IFEval

language

Kimi K2.5

93.9

GLM 5

93.2

Step 3.5 Flash

93.2

Kimi K2 Thinking

92.4

DeepSeek V3.2 Speciale

91.7

32 테스트된 모델

multimodal

VideoMME

multimodal

VideoMME · multimodal benchmark testing video understanding across diverse domains, requiring temporal reasoning and cross-frame comprehension.

Gemini 1.5 Pro (Feb 2024)

66.7

Qwen2.5 72B Instruct

64.7

GPT-4o (2024-11-20)

62.5

GPT-4o (2024-08-06)

62.5

Gemini 1.5 Flash (May 2024)

60.4

8 테스트된 모델