MMLU
The baseline knowledge benchmark everyone cites.
MMLU is a knowledge benchmark tracked by BenchGecko across every frontier and open-weight model.
“MMLU all frontier models score 90%+ · differentiator fading.”
Read full chapterSix terms to go from "I need an AI" to "here is the cheapest model that meets my spec."
The baseline knowledge benchmark everyone cites.
MMLU is a knowledge benchmark tracked by BenchGecko across every frontier and open-weight model.
“MMLU all frontier models score 90%+ · differentiator fading.”
Read full chapterIf your workload is code, this is the one to care about.
A benchmark where models attempt real GitHub issues · judged by whether their patch passes the project's test suite.
“SWE-bench is the single most-watched AI benchmark of 2026. Every coding agent release ships a SWE-bench number first.”
Read full chapterHow much input the model can hold at once.
The max number of tokens · input + output · a model can handle in a single request. Ranges from 32K to 2M in 2026.
“Context windows hit diminishing returns past 200K for most workloads. 1M+ is for agents and codebase-scale retrieval, not chat.”
Read full chapterBy the end you can evaluate a model by benchmark match, price, context window, and speed · and pick the winner for your specific workload.
Seven terms that decode whether AI is overpriced, fairly priced, or criminally underpriced. Read in order.