Question 1

What is a realistic "fast" tokens per second?

Accepted Answer

Human reading speed is about 5 tok/s. Most chat UIs feel instant above 30 tok/s. Real-time voice and interactive agents benefit from 100+ tok/s. Groq and Cerebras push 1000+ for small and mid-size models.

Question 2

How does Groq get 1000+ tok/s?

Accepted Answer

Groq's LPU (Language Processing Unit) is a custom inference chip designed for sequential token generation. No GPU memory bandwidth bottleneck, no KV cache thrashing. Only works for specific supported models.

Question 3

Is Cerebras faster than Groq?

Accepted Answer

For large models (70B+), Cerebras WSE-3 tends to win. Their wafer-scale architecture holds the whole model in on-chip memory. For smaller models, Groq often leads.

Question 4

Why are frontier models slower?

Accepted Answer

Claude Opus, GPT-5, and o-series are huge (hundreds of billions of params) and frequently use extended thinking. Speed is not their design goal · quality and reasoning depth are. Expect 30 to 100 tok/s for frontier visible tokens.

Question 5

Does speed affect cost?

Accepted Answer

Specialized inference chips often charge a premium per token but win total cost of latency. For chatbots and real-time apps, the UX gain usually justifies it. For batch work, slow + cheap wins.

Question 6

Is time-to-first-token more important than tok/s?

Accepted Answer

Depends on UI. For streaming chat, TTFT dominates perceived speed · if it is under 300ms, users will accept modest throughput. For non-streaming APIs, total time matters more.

#	Model · Provider	tok/s	TTFT	Notes
1	Llama 3.3 70B · Cerebras	2100	120ms	Cerebras WSE-3 wafer-scale inference. Fastest on market for 70B class.
2	Qwen3.5 Instruct · Cerebras	1800	140ms	Cerebras on Qwen line.
3	Llama 3.3 70B · Groq	1400	180ms	Groq LPU, open-source model leader on speed.
4	DeepSeek V3.2 · Groq	950	220ms	Groq serves DeepSeek V3.2 via OpenRouter.
5	Qwen3.5 32B · SambaNova	780	200ms	SambaNova reconfigurable dataflow.
6	Llama 3.3 70B · SambaNova	570	240ms	SambaNova high-throughput tier.
7	Mistral Large · Groq	520	220ms	Groq LPU on Mistral Large.
8	Llama 3.3 8B · Together	480	250ms	Small Llama on Together turbo endpoint.
9	Qwen Coder · Fireworks	400	260ms	Fireworks turbo endpoint.
10	DeepSeek V3.2 · Fireworks	320	280ms	Fireworks hosted DeepSeek.
11	Gemini 2.5 Flash	260	450ms	Google flagship fast model.
12	GPT-5 mini	210	550ms	OpenAI mid-tier flagship small model.
13	Claude Haiku	180	500ms	Anthropic fast tier.
14	GPT-5	95	800ms	Frontier model, mid-range speed.
15	Claude Sonnet	85	850ms	Frontier mid-tier.
16	Gemini 2.5 Pro	80	900ms	Frontier Google.
17	Claude Opus	55	1100ms	Highest quality tier, slower.
18	Claude Mythos Preview	45	1400ms	Preview frontier model, extended thinking.
19	o3	40	1500ms	Reasoning model, heavy thinking tokens.
20	DeepSeek R1	35	1800ms	Reasoning model, direct API.

Fastest LLM inference

Ranked by tokens per second

Speed tiers at a glance

Frequently asked questions

See also

Performance

Related

Compare