벤치마크
11개 카테고리에 걸쳐 128개 벤치마크. 클릭하면 전체 순위를 확인할 수 있습니다.
coding
Aider — Code Editing
codingAider polyglot
codingAider Polyglot · measures how well AI models can edit code across multiple programming languages using the Aider coding assistant framework.
CadEval
codingCadEval · evaluates the ability to generate and reason about Computer-Aided Design code, testing spatial reasoning and engineering knowledge.
Cybench
codingCybench · evaluates AI on real Capture-The-Flag cybersecurity challenges, testing vulnerability analysis, exploitation, and security reasoning.
GSO-Bench
codingGSO-Bench · evaluates AI models on real-world open-source software engineering tasks, testing the ability to understand and resolve actual GitHub issues.
LiveBench — Agentic Coding
codingLiveBench — Coding
codingOpenCompass — LiveCodeBenchV6
codingSWE-bench Multilingual
codingSWE-bench Multimodal
codingSWE-bench Pro
codingSWE-Bench verified
codingSWE-bench Verified · 500 human-validated tasks from 12 real Python repositories (Django, Flask, scikit-learn, sympy, and others). Each task requires the model to produce a git patch that resolves a real GitHub issue and passes the test suite. The verified subset eliminates ambiguous tasks from the original SWE-bench. Claude Mythos Preview leads at 93.9%, crossing 90% for the first time in 2026. Opus 4.6 scores 80.8%. The benchmark remains the most-cited evaluation for code-generation capability.
SWE-Bench Verified (Bash Only)
codingSWE-Bench Verified (Bash Only) · a curated subset of SWE-bench where models fix real Python repository bugs using only bash commands, no agent frameworks.
Terminal Bench
codingTerminal-Bench 2.0 · evaluates AI agents on real terminal-based coding tasks · writing scripts, debugging, running tests, and managing projects entirely through command-line interaction. Tests both code quality and terminal fluency. Claude Opus 4.7 scores 69.4%, demonstrating significant agentic terminal competence.
WeirdML
codingWeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns.
knowledge
ANLI
knowledgeANLI (Adversarial NLI) · adversarially constructed natural language inference dataset where each round targets weaknesses found in previous model generations.
ARC AI2
knowledgeAI2 Reasoning Challenge · tests grade-school level science knowledge with multiple-choice questions requiring reasoning beyond simple retrieval.
AudioMultiChallenge
knowledgeAudioMultiChallenge — Audio Output
knowledgeAudioMultiChallenge — Text Output
knowledgeBalrog
knowledgeBalrog · benchmarks AI agents on text-based adventure games, testing language understanding, strategic planning, and long-horizon reasoning.
C-Eval
knowledgeChess Puzzles
knowledgeChess Puzzles · tests strategic and tactical reasoning by having models solve chess puzzle positions, evaluating lookahead and pattern recognition abilities.
CMMLU
knowledgeCSQA2
knowledgeDeepResearch Bench
knowledgeDeepResearch Bench · evaluates AI on complex multi-step research tasks requiring information gathering, synthesis, and producing comprehensive analyses.
EnigmaEval
knowledgeFiction.LiveBench
knowledgeFiction.LiveBench · a continuously updated benchmark using recently published fiction to test reading comprehension and reasoning, preventing data contamination.
GeoBench
knowledgeGeoBench · tests geographic knowledge and spatial reasoning across countries, landmarks, coordinates, and geopolitical understanding.
GPQA
knowledgeGPQA diamond
knowledgeGraduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.
HellaSwag
knowledgeHellaSwag · tests commonsense reasoning by asking models to predict the most plausible continuation of everyday scenarios.
HELM — GPQA
knowledgeHELM — MMLU-Pro
knowledgeHLE
knowledgeHLE (Humanity's Last Exam) · a reasoning benchmark designed to be the hardest public evaluation of AI. Questions span mathematics, physics, philosophy, and logic · curated to be at or beyond the frontier of human expert capability. Tested with and without tool augmentation. Claude Opus 4.7 scores 46.9% without tools and 54.7% with tools · making it one of the few benchmarks where the top score is below 60%.
Humanity's Last Exam
knowledgeHumanity's Last Exam (Text Only)
knowledgeLAMBADA
knowledgeLAMBADA · measures the ability to predict the final word of a passage, requiring broad contextual understanding across long text spans.
Lech Mazur Writing
knowledgeLech Mazur Writing · evaluates creative writing ability, assessing prose quality, narrative coherence, and stylistic sophistication.
LiveBench — Overall
knowledgeMMLU
knowledgeMassive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.
MMLU-PRO
knowledgeMMMLU
knowledgeMultiChallenge
knowledgeMultiNRC
knowledgeOpenBookQA
knowledgeOpenBookQA · science questions that require combining a given core fact with broad common knowledge, mimicking an open-book exam setting.
OpenCompass — GPQA-Diamond
knowledgeOpenCompass — HLE
knowledgeOpenCompass — MMLU-Pro
knowledgePIQA
knowledgePIQA (Physical Interaction QA) · tests intuitive physical reasoning by asking models to select the correct approach for everyday physical tasks.
PostTrainBench
knowledgeProfessional Reasoning — Finance
knowledgeProfessional Reasoning — Legal
knowledgeScienceQA
knowledgeScienceQA · multimodal science questions spanning natural science, social science, and language science with diverse question formats and image context.
SciPredict
knowledgeSimpleQA Verified
knowledgeSimpleQA Verified · short factual questions with verified answers, measuring factual accuracy and the tendency to hallucinate or provide incorrect information.
TriviaQA
knowledgeTriviaQA · reading comprehension benchmark with trivia questions, requiring models to find and reason over evidence from provided documents.
TutorBench
knowledgeVISTA
knowledgeVisualToolBench (VTB)
knowledgeVPCT
knowledgeVPCT (Visual Pattern Completion Test) · tests visual reasoning and pattern recognition by having models complete visual sequences and transformations.
Winogrande
knowledgeWinoGrande · large-scale commonsense reasoning benchmark where models must resolve ambiguous pronouns in carefully constructed sentence pairs.
agentic
APEX-Agents
agenticAPEX-Agents · evaluates AI agents on complex, multi-step tasks requiring planning, tool use, and autonomous decision-making in realistic environments.
MCP Atlas
agenticOSWorld
agenticOSWorld · tests AI agents on real-world computer tasks across operating systems, including web browsing, file management, and application use.
Remote Labor Index (RLI)
agenticSWE Atlas — Codebase QnA
agenticSWE Atlas — Test Writing
agenticSWE-Bench Pro (Private)
agenticSWE-Bench Pro (Public)
agenticThe Agent Company
agenticThe Agent Company · tests AI agents on realistic corporate tasks like email management, code review, data analysis, and cross-tool workflows.
reasoning
ARC-AGI
reasoningARC-AGI · the original Abstraction and Reasoning Corpus, testing whether AI can solve novel visual pattern recognition tasks without memorization.
ARC-AGI-2
reasoningARC-AGI-2 · the second iteration of the Abstraction and Reasoning Corpus, testing novel pattern recognition and abstract reasoning without prior training data.
BBH
reasoningBIG-Bench Hard · a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average humans.
CharXiv Reasoning
reasoningCharXiv Reasoning · tests a model's ability to understand and reason about charts, plots, and figures extracted from real arXiv scientific papers. The model must answer questions that require reading axes, comparing data series, identifying trends, and drawing conclusions from visual scientific data. This benchmark specifically measures the intersection of visual understanding and scientific reasoning · a critical capability for research assistants and document analysis. Performance varies dramatically with and without tool use, making it a key differentiator for multimodal and agentic AI systems.
CharXiv Reasoning (with tools)
reasoningCharXiv Reasoning (with tools) · the tool-augmented variant of CharXiv Reasoning. Models can use code execution, image processing, and other tools to analyze charts from arXiv papers. Claude Mythos Preview scores 93.2% with tools vs 86.1% without · demonstrating how tool use dramatically improves visual scientific reasoning. The gap between tool-augmented and bare performance is a key signal for agent capability.
GraphWalks BFS 256K-1M
reasoningGraphWalks BFS 256K-1M · a long-context reasoning benchmark created by OpenAI that tests whether a model can perform breadth-first search (BFS) traversal across massive graphs encoded in 256,000 to 1,024,000 tokens of context. The model receives a graph represented as an edge list and must follow parent-child relationships across the entire extended context window. This is not simple retrieval · it requires true relational reasoning over hundreds of thousands of tokens. The dataset includes 100 problems at 256K context and 100 problems at 1,024K context. Claude Mythos Preview leads at 80.0%, more than doubling Opus 4.6 (38.7%) and far exceeding GPT-5.4 (21.4%). The massive performance gap between models makes this one of the most discriminating benchmarks for real long-context capability in 2026.
HELM — WildBench
reasoningHLE (with tools)
reasoningLiveBench — Data Analysis
reasoningLiveBench — Reasoning
reasoningMUSR
reasoningSimpleBench
reasoningSimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.
speed
Artificial Analysis — Agentic Index
speedArtificial Analysis Agentic Index · a composite score measuring how well a model performs in agentic workflows · multi-step tool use, planning, error recovery, and autonomous task completion. Aggregates results from multiple agentic benchmarks including SWE-bench, tool-use tests, and planning evaluations. The canonical single-number metric for "how good is this model as an agent?"
Artificial Analysis — Coding Index
speedArtificial Analysis Coding Index · a composite score that aggregates performance across multiple coding benchmarks into a single index. Tracks code generation quality, debugging ability, multi-language competence, and real-world software engineering tasks. Used by Artificial Analysis to rank model coding capability in a normalized, comparable format. Useful for developers choosing between models for coding-heavy workloads.
Artificial Analysis — Quality Index
speedgeneral
BBH (HuggingFace)
generalarena
Chatbot Arena Elo — Coding
arenaChatbot Arena Elo — Overall
arenasafety
Fortress
safetyMASK
safetyPropensityBench
safetymath
FrontierMath-2025-02-28-Private
mathFrontierMath (Feb 2025) · original research-level math problems created by mathematicians, testing capabilities at the boundary of current AI mathematical reasoning.
FrontierMath-Tier-4-2025-07-01-Private
mathFrontierMath Tier 4 (Jul 2025) · the most challenging tier of frontier mathematics, containing problems that push the absolute limits of AI mathematical reasoning.
GSM8K
mathGrade School Math 8K · 8,500 linguistically diverse grade-school math word problems that require multi-step reasoning to solve.
HELM — Omni-MATH
mathLiveBench — Mathematics
mathMATH level 5
mathMATH Level 5 · the hardest tier of the MATH benchmark, featuring competition-level problems from AMC, AIME, and Olympiad-style mathematics.
MATH Level 5
mathOpenCompass — AIME2025
mathOTIS Mock AIME 2024-2025
mathOTIS Mock AIME 2024-2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.
USAMO
mathlanguage
HELM — IFEval
languageIFEval
languageJCommonsenseQA
languageJHumanEval
languageJMMLU
languageJNLI
languageJSQuAD
languageLiveBench — If
languageLiveBench — Language
languageLLM-JP — Overall
languageMMMLU — Arabic
languageMMMLU — Bengali
languageMMMLU — Chinese
languageMMMLU — French
languageMMMLU — German
languageMMMLU — Hindi
languageMMMLU — Indonesian
languageMMMLU — Italian
languageMMMLU — Japanese
languageMMMLU — Korean
languageMMMLU — Portuguese
languageMMMLU — Spanish
languageMMMLU — Swahili
languageMMMLU — Yoruba
languageOpenCompass — IFEval
languagemultimodal
VideoMME
multimodalVideoMME · multimodal benchmark testing video understanding across diverse domains, requiring temporal reasoning and cross-frame comprehension.