Compare · ModelsLive · 3 picked · head to head

DeepSeek V3.2 Speciale vs Kimi K2.5 vs GLM 5

Side by side · benchmarks, pricing, and signals you can act on.

CiteAdd another

Winner summary

Kimi K2.5 wins on 13/21 benchmarks

Kimi K2.5 wins 13 of 21 shared benchmarks. Leads in math · knowledge · language.

DeepSeek V3.2 Speciale

Category leads

math·Kimi K2.5knowledge·Kimi K2.5language·Kimi K2.5coding·GLM 5speed·Kimi K2.5reasoning·Kimi K2.5

Hype vs Reality

Attention vs performance

DeepSeek V3.2 Speciale

#6 by perf·#5 by attention

DESERVED

Kimi K2.5

#87 by perf·no signal

QUIET

GLM 5

#55 by perf·#27 by attention

UNDERRATED

See full mindshare →

Best value

DeepSeek V3.2 Speciale

2.1x better value than GLM 5

DeepSeek V3.2 Speciale

97.8 pts/$

$0.80/M

Kimi K2.5

42.6 pts/$

$1.22/M

GLM 5

45.7 pts/$

$1.26/M

Explore pricing →

Vendor risk

Mixed exposure

One or more vendors flagged

DeepSeek

$3.4B·Tier 1

Higher risk

moonshotai

private · undisclosed

Unknown

z-ai

private · undisclosed

Unknown

See the AI economy →

Head to head

21 benchmarks · 3 models

DeepSeek V3.2 SpecialeKimi K2.5GLM 5

OpenCompass · AIME2025

DeepSeek V3.2 Speciale leads by +0.2

DeepSeek V3.2 Speciale

96.0

Kimi K2.5

91.9

GLM 5

95.8

OpenCompass · GPQA-Diamond

Kimi K2.5 leads by +1.4

DeepSeek V3.2 Speciale

86.7

Kimi K2.5

88.1

GLM 5

85.3

OpenCompass · HLE

DeepSeek V3.2 Speciale

28.6

Kimi K2.5

28.6

GLM 5

28.1

OpenCompass · IFEval

Kimi K2.5 leads by +0.7

DeepSeek V3.2 Speciale

91.7

Kimi K2.5

93.9

GLM 5

93.2

OpenCompass · LiveCodeBenchV6

GLM 5 leads by +5.3

DeepSeek V3.2 Speciale

80.9

Kimi K2.5

80.6

GLM 5

86.2

OpenCompass · MMLU-Pro

Kimi K2.5 leads by +0.7

DeepSeek V3.2 Speciale

85.5

Kimi K2.5

86.2

GLM 5

85.2

Artificial Analysis · Agentic Index

Kimi K2.5 leads by +58.9

Artificial Analysis Agentic Index · a composite score measuring how well a model performs in agentic workflows · multi-step tool use, planning, error recovery, and autonomous task completion. Aggregates results from multiple agentic benchmarks including SWE-bench, tool-use tests, and planning evaluations. The canonical single-number metric for "how good is this model as an agent?"

DeepSeek V3.2 Speciale

0.0

Kimi K2.5

58.9

Artificial Analysis · Coding Index

Kimi K2.5 leads by +1.7

Artificial Analysis Coding Index · a composite score that aggregates performance across multiple coding benchmarks into a single index. Tracks code generation quality, debugging ability, multi-language competence, and real-world software engineering tasks. Used by Artificial Analysis to rank model coding capability in a normalized, comparable format. Useful for developers choosing between models for coding-heavy workloads.

DeepSeek V3.2 Speciale

37.9

Kimi K2.5

39.5

Artificial Analysis · Quality Index

Kimi K2.5 leads by +17.4

DeepSeek V3.2 Speciale

29.4

Kimi K2.5

46.8

ARC-AGI

Kimi K2.5 leads by +20.7

ARC-AGI · the original Abstraction and Reasoning Corpus, testing whether AI can solve novel visual pattern recognition tasks without memorization.

Kimi K2.5

65.3

GLM 5

44.7

ARC-AGI-2

Kimi K2.5 leads by +7.0

ARC-AGI-2 · the second iteration of the Abstraction and Reasoning Corpus, testing novel pattern recognition and abstract reasoning without prior training data.

Kimi K2.5

11.8

GLM 5

4.9

Chess Puzzles

Kimi K2.5 leads by +2.0

Chess Puzzles · tests strategic and tactical reasoning by having models solve chess puzzle positions, evaluating lookahead and pattern recognition abilities.

Kimi K2.5

12.0

GLM 5

10.0

FrontierMath-2025-02-28-Private

Kimi K2.5 leads by +11.5

FrontierMath (Feb 2025) · original research-level math problems created by mathematicians, testing capabilities at the boundary of current AI mathematical reasoning.

Kimi K2.5

27.9

GLM 5

16.4

FrontierMath-Tier-4-2025-07-01-Private

Kimi K2.5 leads by +2.1

FrontierMath Tier 4 (Jul 2025) · the most challenging tier of frontier mathematics, containing problems that push the absolute limits of AI mathematical reasoning.

Kimi K2.5

4.2

GLM 5

2.1

GPQA diamond

GLM 5 leads by +0.3

Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.

Kimi K2.5

83.5

GLM 5

83.8

OTIS Mock AIME 2024-2025

Kimi K2.5 leads by +12.2

OTIS Mock AIME 2024-2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.

Kimi K2.5

92.2

GLM 5

80.0

PostTrainBench

GLM 5 leads by +3.6

Kimi K2.5

10.3

GLM 5

13.9

SimpleBench

GLM 5 leads by +7.7

SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.

Kimi K2.5

36.2

GLM 5

43.8

SWE-Bench verified

Kimi K2.5 leads by +1.7

SWE-bench Verified · 500 human-validated tasks from 12 real Python repositories (Django, Flask, scikit-learn, sympy, and others). Each task requires the model to produce a git patch that resolves a real GitHub issue and passes the test suite. The verified subset eliminates ambiguous tasks from the original SWE-bench. Claude Mythos Preview leads at 93.9%, crossing 90% for the first time in 2026. Opus 4.6 scores 80.8%. The benchmark remains the most-cited evaluation for code-generation capability.

Kimi K2.5

73.8

GLM 5

72.1

Terminal Bench

GLM 5 leads by +9.2

Terminal-Bench 2.0 · evaluates AI agents on real terminal-based coding tasks · writing scripts, debugging, running tests, and managing projects entirely through command-line interaction. Tests both code quality and terminal fluency. Claude Opus 4.7 scores 69.4%, demonstrating significant agentic terminal competence.

Kimi K2.5

43.2

GLM 5

52.4

WeirdML

GLM 5 leads by +2.6

WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns.

Kimi K2.5

45.6

GLM 5

48.2

Full benchmark table

Benchmark	DeepSeek V3.2 Speciale	Kimi K2.5	GLM 5
OpenCompass · AIME2025	96.0	91.9	95.8
OpenCompass · GPQA-Diamond	86.7	88.1	85.3
OpenCompass · HLE	28.6	28.6	28.1
OpenCompass · IFEval	91.7	93.9	93.2
OpenCompass · LiveCodeBenchV6	80.9	80.6	86.2
OpenCompass · MMLU-Pro	85.5	86.2	85.2
Artificial Analysis · Agentic Index Artificial Analysis Agentic Index · a composite score measuring how well a model performs in agentic workflows · multi-step tool use, planning, error recovery, and autonomous task completion. Aggregates results from multiple agentic benchmarks including SWE-bench, tool-use tests, and planning evaluations. The canonical single-number metric for "how good is this model as an agent?"	0.0	58.9	—
Artificial Analysis · Coding Index Artificial Analysis Coding Index · a composite score that aggregates performance across multiple coding benchmarks into a single index. Tracks code generation quality, debugging ability, multi-language competence, and real-world software engineering tasks. Used by Artificial Analysis to rank model coding capability in a normalized, comparable format. Useful for developers choosing between models for coding-heavy workloads.	37.9	39.5	—
Artificial Analysis · Quality Index	29.4	46.8	—
ARC-AGI ARC-AGI · the original Abstraction and Reasoning Corpus, testing whether AI can solve novel visual pattern recognition tasks without memorization.	—	65.3	44.7
ARC-AGI-2 ARC-AGI-2 · the second iteration of the Abstraction and Reasoning Corpus, testing novel pattern recognition and abstract reasoning without prior training data.	—	11.8	4.9
Chess Puzzles Chess Puzzles · tests strategic and tactical reasoning by having models solve chess puzzle positions, evaluating lookahead and pattern recognition abilities.	—	12.0	10.0
FrontierMath-2025-02-28-Private FrontierMath (Feb 2025) · original research-level math problems created by mathematicians, testing capabilities at the boundary of current AI mathematical reasoning.	—	27.9	16.4
FrontierMath-Tier-4-2025-07-01-Private FrontierMath Tier 4 (Jul 2025) · the most challenging tier of frontier mathematics, containing problems that push the absolute limits of AI mathematical reasoning.	—	4.2	2.1
GPQA diamond Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.	—	83.5	83.8
OTIS Mock AIME 2024-2025 OTIS Mock AIME 2024-2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.	—	92.2	80.0
PostTrainBench	—	10.3	13.9
SimpleBench SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.	—	36.2	43.8
SWE-Bench verified SWE-bench Verified · 500 human-validated tasks from 12 real Python repositories (Django, Flask, scikit-learn, sympy, and others). Each task requires the model to produce a git patch that resolves a real GitHub issue and passes the test suite. The verified subset eliminates ambiguous tasks from the original SWE-bench. Claude Mythos Preview leads at 93.9%, crossing 90% for the first time in 2026. Opus 4.6 scores 80.8%. The benchmark remains the most-cited evaluation for code-generation capability.	—	73.8	72.1
Terminal Bench Terminal-Bench 2.0 · evaluates AI agents on real terminal-based coding tasks · writing scripts, debugging, running tests, and managing projects entirely through command-line interaction. Tests both code quality and terminal fluency. Claude Opus 4.7 scores 69.4%, demonstrating significant agentic terminal competence.	—	43.2	52.4
WeirdML WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns.	—	45.6	48.2

Pricing · per 1M tokens · projected $/mo at 10M tokens

Model	Input	Output	Context	Projected $/mo
DeepSeek V3.2 Speciale	$0.40	$1.20	164K tokens (~82 books)	$6.00
Kimi K2.5	$0.44	$2.00	262K tokens (~131 books)	$8.30
GLM 5	$0.60	$1.92	203K tokens (~101 books)	$9.30

People also compared

DeepSeek V3.2 Speciale vs GPT-5.5 Pro DeepSeek V3.2 Speciale vs GPT-5.5 DeepSeek V3.2 Speciale vs GPT-5 Chat Claude Mythos Preview vs DeepSeek V3.2 Speciale DeepSeek V3.2 Speciale vs Qwen3.5 397B A17B Claude Instant vs DeepSeek V3.2 Speciale DeepSeek V3.2 Speciale vs Step 3.5 Flash DeepSeek-V2 (MoE-236B, May 2024) vs DeepSeek V3.2 Speciale