Performance · Speed tier
Fastest LLM inference
Every major LLM ranked by tokens per second. Specialized inference chips (Groq, Cerebras, SambaNova) dominate the leaderboard.
Models tracked20
Fastest2100 tok/s
ScopeGroq · Cerebras · SambaNova · Major labs
What this page is
This page ranks models by inference speed (tokens per second) across providers. Specialized chips (Groq LPU, Cerebras WSE-3, SambaNova RDU) dominate the top. Major labs (OpenAI, Anthropic, Google) cluster in the mid-tier because their focus is quality, not raw throughput. Use this page when latency is a product feature: real-time chat, voice, and interactive agents.
Ranked by tokens per second
| # | Model · Provider | tok/s |
|---|---|---|
| 1 | 2100 | |
| 2 | 1800 | |
| 3 | 1400 | |
| 4 | 950 | |
| 5 | 780 | |
| 6 | 570 | |
| 7 | 520 | |
| 8 | 480 | |
| 9 | 400 | |
| 10 | 320 | |
| 11 | 260 | |
| 12 | 210 | |
| 13 | 180 | |
| 14 | 95 | |
| 15 | 85 | |
| 16 | 80 | |
| 17 | 55 | |
| 18 | 45 | |
| 19 | 40 | |
| 20 | 35 |
Speed tiers at a glance
Ultra-fast (1000+ tok/s)
Groq · Cerebras · SambaNova
Specialized inference chips. Supports a curated set of open-source models. Use for real-time voice, live code completion, and latency-sensitive agents.
Fast (100 to 500 tok/s)
Together · Fireworks · DeepInfra
GPU-based but tuned for throughput. Broad model selection. Solid choice for production chat and RAG.
Standard (30 to 100 tok/s)
OpenAI · Anthropic · Google
Frontier labs. Speed is not the design goal · quality and reasoning depth are. Use when you care about the best answer more than the fastest.
Frequently asked questions
Human reading speed is about 5 tok/s. Most chat UIs feel instant above 30 tok/s. Real-time voice and interactive agents benefit from 100+ tok/s. Groq and Cerebras push 1000+ for small and mid-size models.