What benchmark actually predicts agent quality?

tau-bench (customer service tool use), WebArena (browser tasks), GAIA (general assistants), and SWE-bench Verified (repo-level code agents). Single-turn benchmarks like MMLU do not predict agent performance well.

Should I use a cheap model inside my agent loop?

Yes, for most steps. Route the hard planning step to a frontier model (Claude Opus, GPT-5), then use a cheap model for routine tool calls, retries, and tool-result summarization. See the agent stack for the full pattern.

Which providers support reliable tool calling?

Anthropic, OpenAI, Google, Mistral, Together, and Fireworks all have stable JSON-mode tool calling. DeepSeek and Qwen work for most cases but occasionally hallucinate tool names. Test on your schema.

How do I keep an agent under a budget cap?

Set a hard token cap per run, enable prompt caching (Anthropic, OpenAI, DeepInfra all support it), and log every call to detect runaway loops.

Use case · Agents

Cheapest agent LLMs

The cheapest LLMs that score well on agentic benchmarks. Ranked by price per 1M input tokens.

Models30

Cheapest$0.00

ScopeSWE · tau · WebArena

All pricing Pricing home

What this page is

This page ranks every LLM with meaningful agentic scores (SWE-bench, tau-bench, WebArena, GAIA, AgentBench) by input price. Agent workloads spike token consumption because of looping, tool schemas, and retries. Choosing a cheaper model for routine tool calls while reserving frontier models for planning can cut agent bills by 50 to 90 percent.

Ranked by input price

Models with agent benchmark scores, cheapest first.

#	Model	Provider	In $/1M	Out $/1M	Context	avg score	Type
1	Gemma 4 26B A4B (free)	Google DeepMind	$0.00	$0.00	262K	0.0	OSS
2	Gemma 4 31B (free)	Google DeepMind	$0.00	$0.00	262K	0.0	OSS
3	gpt-oss-120b (free)	OpenAI	$0.00	$0.00	131K	68.7	OSS
4	gpt-oss-20b (free)	OpenAI	$0.00	$0.00	131K	66.4	OSS
5	LFM2.5-1.2B-Instruct (free)	liquid	$0.00	$0.00	33K	0.0	OSS
6	LFM2.5-1.2B-Thinking (free)	liquid	$0.00	$0.00	33K	0.0	OSS
7	Qwen3 Coder 480B A35B (free)	Alibaba Qwen	$0.00	$0.00	262K	0.0	OSS
8	Qwen3 Next 80B A3B Instruct (free)	Alibaba Qwen	$0.00	$0.00	262K	0.0	OSS
9	Granite 4.0 Micro	ibm-granite	$0.02	$0.11	131K	0.0	OSS
10	LFM2-24B-A2B	liquid	$0.03	$0.12	33K	0.0	OSS
11	gpt-oss-120b	OpenAI	$0.04	$0.18	131K	46.9	OSS
12	GPT-5 Nano	OpenAI	$0.05	$0.40	400K	45.3	Closed
13	Qwen3 235B A22B Instruct 2507	Alibaba Qwen	$0.07	$0.10	262K	48.5	OSS
14	Llama 4 Scout	Meta	$0.08	$0.30	328K	18.9	OSS
15	MiMo-V2-Flash	xiaomi	$0.09	$0.29	262K	73.3	OSS
16	Nemotron 3 Super	NVIDIA	$0.09	$0.45	262K	32.5	OSS
17	Qwen3 Next 80B A3B Instruct	Alibaba Qwen	$0.09	$1.10	262K	54.4	OSS
18	Qwen3 Next 80B A3B Thinking	Alibaba Qwen	$0.10	$0.78	131K	61.6	OSS
19	Gemini 2.0 Flash	Google DeepMind	$0.10	$0.40	1.0M	48.0	Closed
20	Gemini 2.5 Flash Lite	Google DeepMind	$0.10	$0.40	1.0M	59.1	Closed
21	Qwen3.5-9B	Alibaba Qwen	$0.10	$0.15	262K	0.0	OSS
22	Step 3.5 Flash	stepfun	$0.10	$0.30	262K	76.9	OSS
23	Qwen3 Coder Next	Alibaba Qwen	$0.12	$0.80	262K	0.0	OSS
24	Gemma 4 31B	Google DeepMind	$0.13	$0.38	262K	61.6	OSS
25	Qwen3 235B A22B Thinking 2507	Alibaba Qwen	$0.15	$1.50	131K	55.9	OSS
26	Llama 4 Maverick	Meta	$0.15	$0.60	1.0M	28.0	OSS
27	MiniMax M2.5	minimax	$0.15	$1.15	197K	55.1	OSS
28	Mistral Small 4	Mistral AI	$0.15	$0.60	262K	0.0	OSS
29	Qwen3.5-35B-A3B	Alibaba Qwen	$0.15	$1.00	262K	0.0	OSS
30	Solar Pro 3	upstage	$0.15	$0.60	128K	0.0	Closed

Top 3 cheapest agent LLMs

Cheapest agent LLM

Gemma 4 26B A4B (free)

Gemma 4 26B A4B (free) has agentic scores (SWE-bench, tau-bench, WebArena) and lands at $0.00 per 1M input tokens. Useful when agent loops spike token spend.

Gemma 4 31B (free) has agentic scores (SWE-bench, tau-bench, WebArena) and lands at $0.00 per 1M input tokens. Useful when agent loops spike token spend.

gpt-oss-120b (free) has agentic scores (SWE-bench, tau-bench, WebArena) and lands at $0.00 per 1M input tokens. Useful when agent loops spike token spend.

The price gap · cheapest vs most expensive

Cheapest

Gemma 4 26B A4B (free)

$0.00/M

$ per 1M input tokens

Why the gap

Premium agent models pay for more reliable tool use, fewer hallucinated tool names, and longer-horizon planning. For simple tool loops, the cheap end works fine.

Most expensive

Llama 4 Maverick

$0.15/M

$ per 1M input tokens

Frequently asked questions

Agents loop. A single task may produce 10 to 100 LLM calls with heavy context re-use. That amplifies input costs (because the system prompt and tool schema repeat) and multiplies output costs for every thinking step.

Cheapest agent LLMs

Ranked by input price

Top 3 cheapest agent LLMs

The price gap · cheapest vs most expensive

Frequently asked questions

See also

Related pricing

Stacks

Compare