Learning path3 terms · ~9 min read

Pick an AI Model

Six terms to go from "I need an AI" to "here is the cheapest model that meets my spec."

BenchmarksChapter 1 of 3

MMLU

The baseline knowledge benchmark everyone cites.

TL;DR

MMLU is a knowledge benchmark tracked by BenchGecko across every frontier and open-weight model.

“MMLU all frontier models score 90%+ · differentiator fading.”

BenchmarksChapter 2 of 3

If your workload is code, this is the one to care about.

TL;DR

A benchmark where models attempt real GitHub issues · judged by whether their patch passes the project's test suite.

“SWE-bench is the single most-watched AI benchmark of 2026. Every coding agent release ships a SWE-bench number first.”

ConceptsChapter 3 of 3

How much input the model can hold at once.

TL;DR

The max number of tokens · input + output · a model can handle in a single request. Ranges from 32K to 2M in 2026.

“Context windows hit diminishing returns past 200K for most workloads. 1M+ is for agents and codebase-scale retrieval, not chat.”

What you learned

By the end you can evaluate a model by benchmark match, price, context window, and speed · and pick the winner for your specific workload.

Keep learning

Seven terms that decode whether AI is overpriced, fairly priced, or criminally underpriced. Read in order.