Compare · ModelsLive · 3 picked · head to head

Kimi K2.5 vs GLM 5 vs Step 3.5 Flash

Side by side · benchmarks, pricing, and signals you can act on.

CiteAdd another

Winner summary

Kimi K2.5 wins on 14/22 benchmarks

Kimi K2.5 wins 14 of 22 shared benchmarks. Leads in math · knowledge · language.

Category leads

math·Kimi K2.5knowledge·Kimi K2.5language·Kimi K2.5coding·GLM 5speed·Kimi K2.5reasoning·Kimi K2.5arena·GLM 5

Hype vs Reality

Attention vs performance

Kimi K2.5

#87 by perf·no signal

QUIET

GLM 5

#55 by perf·#27 by attention

UNDERRATED

Step 3.5 Flash

#9 by perf·#11 by attention

DESERVED

See full mindshare →

Best value

Step 3.5 Flash

8.4x better value than GLM 5

Kimi K2.5

42.6 pts/$

$1.22/M

GLM 5

45.7 pts/$

$1.26/M

Step 3.5 Flash

384.5 pts/$

$0.20/M

Explore pricing →

Vendor risk

Mixed exposure

One or more vendors flagged

moonshotai

private · undisclosed

Unknown

z-ai

private · undisclosed

Unknown

StepFun

$5.0B·Tier 1

Higher risk

See the AI economy →

Head to head

22 benchmarks · 3 models

Kimi K2.5GLM 5Step 3.5 Flash

OpenCompass · AIME2025

GLM 5 leads by +0.1

Kimi K2.5

91.9

GLM 5

95.8

Step 3.5 Flash

95.7

OpenCompass · GPQA-Diamond

Kimi K2.5 leads by +2.8

Kimi K2.5

88.1

GLM 5

85.3

Step 3.5 Flash

83.7

OpenCompass · HLE

Kimi K2.5 leads by +0.5

Kimi K2.5

28.6

GLM 5

28.1

Step 3.5 Flash

21.6

OpenCompass · IFEval

Kimi K2.5 leads by +0.7

Kimi K2.5

93.9

GLM 5

93.2

Step 3.5 Flash

93.2

OpenCompass · LiveCodeBenchV6

GLM 5 leads by +2.3

Kimi K2.5

80.6

GLM 5

86.2

Step 3.5 Flash

83.9

OpenCompass · MMLU-Pro

Kimi K2.5 leads by +1.0

Kimi K2.5

86.2

GLM 5

85.2

Step 3.5 Flash

83.5

Artificial Analysis · Agentic Index

Kimi K2.5 leads by +6.9

Artificial Analysis Agentic Index · a composite score measuring how well a model performs in agentic workflows · multi-step tool use, planning, error recovery, and autonomous task completion. Aggregates results from multiple agentic benchmarks including SWE-bench, tool-use tests, and planning evaluations. The canonical single-number metric for "how good is this model as an agent?"

Kimi K2.5

58.9

Step 3.5 Flash

52.0

Artificial Analysis · Coding Index

Kimi K2.5 leads by +7.9

Artificial Analysis Coding Index · a composite score that aggregates performance across multiple coding benchmarks into a single index. Tracks code generation quality, debugging ability, multi-language competence, and real-world software engineering tasks. Used by Artificial Analysis to rank model coding capability in a normalized, comparable format. Useful for developers choosing between models for coding-heavy workloads.

Kimi K2.5

39.5

Step 3.5 Flash

31.6

Artificial Analysis · Quality Index

Kimi K2.5 leads by +9.0

Kimi K2.5

46.8

Step 3.5 Flash

37.8

ARC-AGI

Kimi K2.5 leads by +20.7

ARC-AGI · the original Abstraction and Reasoning Corpus, testing whether AI can solve novel visual pattern recognition tasks without memorization.

Kimi K2.5

65.3

GLM 5

44.7

ARC-AGI-2

Kimi K2.5 leads by +7.0

ARC-AGI-2 · the second iteration of the Abstraction and Reasoning Corpus, testing novel pattern recognition and abstract reasoning without prior training data.

Kimi K2.5

11.8

GLM 5

4.9

Chatbot Arena Elo · Overall

GLM 5 leads by +64.2

GLM 5

1455.6

Step 3.5 Flash

1391.4

Chess Puzzles

Kimi K2.5 leads by +2.0

Chess Puzzles · tests strategic and tactical reasoning by having models solve chess puzzle positions, evaluating lookahead and pattern recognition abilities.

Kimi K2.5

12.0

GLM 5

10.0

FrontierMath-2025-02-28-Private

Kimi K2.5 leads by +11.5

FrontierMath (Feb 2025) · original research-level math problems created by mathematicians, testing capabilities at the boundary of current AI mathematical reasoning.

Kimi K2.5

27.9

GLM 5

16.4

FrontierMath-Tier-4-2025-07-01-Private

Kimi K2.5 leads by +2.1

FrontierMath Tier 4 (Jul 2025) · the most challenging tier of frontier mathematics, containing problems that push the absolute limits of AI mathematical reasoning.

Kimi K2.5

4.2

GLM 5

2.1

GPQA diamond

GLM 5 leads by +0.3

Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.

Kimi K2.5

83.5

GLM 5

83.8

OTIS Mock AIME 2024-2025

Kimi K2.5 leads by +12.2

OTIS Mock AIME 2024-2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.

Kimi K2.5

92.2

GLM 5

80.0

PostTrainBench

GLM 5 leads by +3.6

Kimi K2.5

10.3

GLM 5

13.9

SimpleBench

GLM 5 leads by +7.7

SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.

Kimi K2.5

36.2

GLM 5

43.8

SWE-Bench verified

Kimi K2.5 leads by +1.7

SWE-bench Verified · 500 human-validated tasks from 12 real Python repositories (Django, Flask, scikit-learn, sympy, and others). Each task requires the model to produce a git patch that resolves a real GitHub issue and passes the test suite. The verified subset eliminates ambiguous tasks from the original SWE-bench. Claude Mythos Preview leads at 93.9%, crossing 90% for the first time in 2026. Opus 4.6 scores 80.8%. The benchmark remains the most-cited evaluation for code-generation capability.

Kimi K2.5

73.8

GLM 5

72.1

Terminal Bench

GLM 5 leads by +9.2

Terminal-Bench 2.0 · evaluates AI agents on real terminal-based coding tasks · writing scripts, debugging, running tests, and managing projects entirely through command-line interaction. Tests both code quality and terminal fluency. Claude Opus 4.7 scores 69.4%, demonstrating significant agentic terminal competence.

Kimi K2.5

43.2

GLM 5

52.4

WeirdML

GLM 5 leads by +2.6

WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns.

Kimi K2.5

45.6

GLM 5

48.2

Full benchmark table

Benchmark	Kimi K2.5	GLM 5	Step 3.5 Flash
OpenCompass · AIME2025	91.9	95.8	95.7
OpenCompass · GPQA-Diamond	88.1	85.3	83.7
OpenCompass · HLE	28.6	28.1	21.6
OpenCompass · IFEval	93.9	93.2	93.2
OpenCompass · LiveCodeBenchV6	80.6	86.2	83.9
OpenCompass · MMLU-Pro	86.2	85.2	83.5
Artificial Analysis · Agentic Index Artificial Analysis Agentic Index · a composite score measuring how well a model performs in agentic workflows · multi-step tool use, planning, error recovery, and autonomous task completion. Aggregates results from multiple agentic benchmarks including SWE-bench, tool-use tests, and planning evaluations. The canonical single-number metric for "how good is this model as an agent?"	58.9	—	52.0
Artificial Analysis · Coding Index Artificial Analysis Coding Index · a composite score that aggregates performance across multiple coding benchmarks into a single index. Tracks code generation quality, debugging ability, multi-language competence, and real-world software engineering tasks. Used by Artificial Analysis to rank model coding capability in a normalized, comparable format. Useful for developers choosing between models for coding-heavy workloads.	39.5	—	31.6
Artificial Analysis · Quality Index	46.8	—	37.8
ARC-AGI ARC-AGI · the original Abstraction and Reasoning Corpus, testing whether AI can solve novel visual pattern recognition tasks without memorization.	65.3	44.7	—
ARC-AGI-2 ARC-AGI-2 · the second iteration of the Abstraction and Reasoning Corpus, testing novel pattern recognition and abstract reasoning without prior training data.	11.8	4.9	—
Chatbot Arena Elo · Overall	—	1455.6	1391.4
Chess Puzzles Chess Puzzles · tests strategic and tactical reasoning by having models solve chess puzzle positions, evaluating lookahead and pattern recognition abilities.	12.0	10.0	—
FrontierMath-2025-02-28-Private FrontierMath (Feb 2025) · original research-level math problems created by mathematicians, testing capabilities at the boundary of current AI mathematical reasoning.	27.9	16.4	—
FrontierMath-Tier-4-2025-07-01-Private FrontierMath Tier 4 (Jul 2025) · the most challenging tier of frontier mathematics, containing problems that push the absolute limits of AI mathematical reasoning.	4.2	2.1	—
GPQA diamond Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.	83.5	83.8	—
OTIS Mock AIME 2024-2025 OTIS Mock AIME 2024-2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.	92.2	80.0	—
PostTrainBench	10.3	13.9	—
SimpleBench SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.	36.2	43.8	—
SWE-Bench verified SWE-bench Verified · 500 human-validated tasks from 12 real Python repositories (Django, Flask, scikit-learn, sympy, and others). Each task requires the model to produce a git patch that resolves a real GitHub issue and passes the test suite. The verified subset eliminates ambiguous tasks from the original SWE-bench. Claude Mythos Preview leads at 93.9%, crossing 90% for the first time in 2026. Opus 4.6 scores 80.8%. The benchmark remains the most-cited evaluation for code-generation capability.	73.8	72.1	—
Terminal Bench Terminal-Bench 2.0 · evaluates AI agents on real terminal-based coding tasks · writing scripts, debugging, running tests, and managing projects entirely through command-line interaction. Tests both code quality and terminal fluency. Claude Opus 4.7 scores 69.4%, demonstrating significant agentic terminal competence.	43.2	52.4	—
WeirdML WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns.	45.6	48.2	—

Pricing · per 1M tokens · projected $/mo at 10M tokens

Model	Input	Output	Context	Projected $/mo
Kimi K2.5	$0.44	$2.00	262K tokens (~131 books)	$8.30
GLM 5	$0.60	$1.92	203K tokens (~101 books)	$9.30
Step 3.5 Flash	$0.10	$0.30	262K tokens (~131 books)	$1.50

People also compared

GPT-5.5 Pro vs Step 3.5 Flash GPT-5.5 vs Step 3.5 Flash GPT-5 Chat vs Step 3.5 Flash Claude Mythos Preview vs Step 3.5 Flash Qwen3.5 397B A17B vs Step 3.5 Flash DeepSeek V3.2 Speciale vs Step 3.5 Flash Claude Instant vs Step 3.5 Flash DeepSeek-V2 (MoE-236B, May 2024) vs Step 3.5 Flash