Benchmarks
40 benchmarks em 6 categorias. Clique para ver a classificação completa.
knowledge
ARC AI2
knowledgeAI2 Reasoning Challenge — tests grade-school level science knowledge with multiple-choice questions requiring reasoning beyond simple retrieval.
HellaSwag
knowledgeHellaSwag — tests commonsense reasoning by asking models to predict the most plausible continuation of everyday scenarios.
LAMBADA
knowledgeLAMBADA — measures the ability to predict the final word of a passage, requiring broad contextual understanding across long text spans.
MMLU
knowledgeMassive Multitask Language Understanding — 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.
GPQA diamond
knowledgeGraduate-Level Google-Proof QA (Diamond set) — expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.
Winogrande
knowledgeWinoGrande — large-scale commonsense reasoning benchmark where models must resolve ambiguous pronouns in carefully constructed sentence pairs.
Lech Mazur Writing
knowledgeLech Mazur Writing — evaluates creative writing ability, assessing prose quality, narrative coherence, and stylistic sophistication.
Fiction.LiveBench
knowledgeFiction.LiveBench — a continuously updated benchmark using recently published fiction to test reading comprehension and reasoning, preventing data contamination.
SimpleQA Verified
knowledgeSimpleQA Verified — short factual questions with verified answers, measuring factual accuracy and the tendency to hallucinate or provide incorrect information.
Chess Puzzles
knowledgeChess Puzzles — tests strategic and tactical reasoning by having models solve chess puzzle positions, evaluating lookahead and pattern recognition abilities.
HLE
knowledgeHLE (Humanity's Last Exam) — crowdsourced expert-level questions designed to be among the hardest possible challenges for AI systems across all domains.
TriviaQA
knowledgeTriviaQA — reading comprehension benchmark with trivia questions, requiring models to find and reason over evidence from provided documents.
ScienceQA
knowledgeScienceQA — multimodal science questions spanning natural science, social science, and language science with diverse question formats and image context.
PIQA
knowledgePIQA (Physical Interaction QA) — tests intuitive physical reasoning by asking models to select the correct approach for everyday physical tasks.
OpenBookQA
knowledgeOpenBookQA — science questions that require combining a given core fact with broad common knowledge, mimicking an open-book exam setting.
Balrog
knowledgeBalrog — benchmarks AI agents on text-based adventure games, testing language understanding, strategic planning, and long-horizon reasoning.
GeoBench
knowledgeGeoBench — tests geographic knowledge and spatial reasoning across countries, landmarks, coordinates, and geopolitical understanding.
ANLI
knowledgeANLI (Adversarial NLI) — adversarially constructed natural language inference dataset where each round targets weaknesses found in previous model generations.
DeepResearch Bench
knowledgeDeepResearch Bench — evaluates AI on complex multi-step research tasks requiring information gathering, synthesis, and producing comprehensive analyses.
VPCT
knowledgeVPCT (Visual Pattern Completion Test) — tests visual reasoning and pattern recognition by having models complete visual sequences and transformations.
reasoning
BBH
reasoningBIG-Bench Hard — a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average humans.
SimpleBench
reasoningSimpleBench — tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.
ARC-AGI-2
reasoningARC-AGI-2 — the second iteration of the Abstraction and Reasoning Corpus, testing novel pattern recognition and abstract reasoning without prior training data.
ARC-AGI
reasoningARC-AGI — the original Abstraction and Reasoning Corpus, testing whether AI can solve novel visual pattern recognition tasks without memorization.
math
GSM8K
mathGrade School Math 8K — 8,500 linguistically diverse grade-school math word problems that require multi-step reasoning to solve.
MATH level 5
mathMATH Level 5 — the hardest tier of the MATH benchmark, featuring competition-level problems from AMC, AIME, and Olympiad-style mathematics.
OTIS Mock AIME 2024-2025
mathOTIS Mock AIME 2024–2025 — simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.
FrontierMath-2025-02-28-Private
mathFrontierMath (Feb 2025) — original research-level math problems created by mathematicians, testing capabilities at the boundary of current AI mathematical reasoning.
FrontierMath-Tier-4-2025-07-01-Private
mathFrontierMath Tier 4 (Jul 2025) — the most challenging tier of frontier mathematics, containing problems that push the absolute limits of AI mathematical reasoning.
coding
WeirdML
codingWeirdML — tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns.
Aider polyglot
codingAider Polyglot — measures how well AI models can edit code across multiple programming languages using the Aider coding assistant framework.
GSO-Bench
codingGSO-Bench — evaluates AI models on real-world open-source software engineering tasks, testing the ability to understand and resolve actual GitHub issues.
SWE-Bench Verified (Bash Only)
codingSWE-Bench Verified (Bash Only) — a curated subset of SWE-bench where models fix real Python repository bugs using only bash commands, no agent frameworks.
Terminal Bench
codingTerminal Bench — tests the ability to accomplish real-world tasks using terminal commands, evaluating shell scripting and CLI tool proficiency.
CadEval
codingCadEval — evaluates the ability to generate and reason about Computer-Aided Design code, testing spatial reasoning and engineering knowledge.
Cybench
codingCybench — evaluates AI on real Capture-The-Flag cybersecurity challenges, testing vulnerability analysis, exploitation, and security reasoning.
agentic
APEX-Agents
agenticAPEX-Agents — evaluates AI agents on complex, multi-step tasks requiring planning, tool use, and autonomous decision-making in realistic environments.
OSWorld
agenticOSWorld — tests AI agents on real-world computer tasks across operating systems, including web browsing, file management, and application use.
The Agent Company
agenticThe Agent Company — tests AI agents on realistic corporate tasks like email management, code review, data analysis, and cross-tool workflows.
multimodal
VideoMME
multimodalVideoMME — multimodal benchmark testing video understanding across diverse domains, requiring temporal reasoning and cross-frame comprehension.