How often are Benchmarks entries updated?

Entries auto-sync with underlying seed data. Definitions stay stable; live data callouts (benchmarks, prices, valuations) update daily. See /methodology for the full refresh schedule.

Where does the data come from?

BenchGecko ingests from OpenRouter (pricing), Epoch AI (benchmarks), public earnings filings (companies), vendor spec sheets (chips + memory + systems), and hand-curated seeds for concepts and compliance.

Category47 terms · BenchGecko glossary

Benchmarks

How models are measured · SWE-bench, GPQA, MMLU.

Learn hub

Most-read in Benchmarks

Top 12 terms

Benchmarks

SWE-bench

The real-world coding benchmark · AI resolves actual GitHub issues in open-source Python repos.

Read

Benchmarks

HumanEval

A 164-problem Python benchmark where the model writes a function from its docstring and passes unit tests.

Read

Benchmarks

LiveBench

A contamination-resistant benchmark that refreshes tasks monthly to prevent models from memorizing answers.

Read

Benchmarks

Chatbot Arena

Crowdsourced head-to-head AI model comparison · humans vote on anonymous outputs and Elo ratings rank the models.

Read

Benchmarks

MMLU Pro

A harder version of MMLU with 10 answer choices, filtered noise, and more reasoning-heavy questions.

Read

Benchmarks

ARC AI2

ARC AI2 is a knowledge benchmark tracked by BenchGecko across every frontier and open-weight model.

Read

Benchmarks

BBH

BBH is a reasoning benchmark tracked by BenchGecko across every frontier and open-weight model.

Read

Benchmarks

GSM8K

GSM8K is a math benchmark tracked by BenchGecko across every frontier and open-weight model.

Read

Benchmarks

HellaSwag

HellaSwag is a knowledge benchmark tracked by BenchGecko across every frontier and open-weight model.

Read

Benchmarks

LAMBADA

LAMBADA is a knowledge benchmark tracked by BenchGecko across every frontier and open-weight model.

Read

Benchmarks

MMLU

MMLU is a knowledge benchmark tracked by BenchGecko across every frontier and open-weight model.

Read

Benchmarks

GPQA diamond

GPQA diamond is a knowledge benchmark tracked by BenchGecko across every frontier and open-weight model.

Read

Explore more

Other categories

Chips

Memory

Model families

Companies

Foundry

Systems

Concepts

Pricing

Economy

Agents

Compliance

People

Frequently Asked Questions

The Benchmarks category covers 47 terms. How models are measured · SWE-bench, GPQA, MMLU. Every term has four depth levels (TL;DR, Basic, Deep, Expert), role-based takeaways, FAQs, and live BenchGecko data where available.

Benchmarks

Top 12 terms

All 47 · A-Z

Other categories

Chips

Memory

Model families

Companies

Foundry

Systems

Concepts

Pricing

Economy

Agents

Compliance

People

Frequently Asked Questions

Keep exploring

Related data

Adjacent layers