Compare · ModelsLive · 3 picked · head to head

GLM 5 vs Step 3.5 Flash vs GLM 4.7

Side by side · benchmarks, pricing, and signals you can act on.

CiteAdd another

Winner summary

GLM 5 wins on 22/24 benchmarks

GLM 5 wins 22 of 24 shared benchmarks. Leads in arena · math · knowledge.

Category leads

arena·GLM 5math·GLM 5knowledge·GLM 5language·GLM 5coding·GLM 5reasoning·GLM 5

Hype vs Reality

Attention vs performance

GLM 5

#55 by perf·#27 by attention

UNDERRATED

Step 3.5 Flash

#9 by perf·#11 by attention

DESERVED

GLM 4.7

#93 by perf·no signal

QUIET

See full mindshare →

Best value

Step 3.5 Flash

8.1x better value than GLM 4.7

GLM 5

45.7 pts/$

$1.26/M

Step 3.5 Flash

384.5 pts/$

$0.20/M

GLM 4.7

47.6 pts/$

$1.06/M

Explore pricing →

Vendor risk

Mixed exposure

One or more vendors flagged

z-ai

private · undisclosed

Unknown

StepFun

$5.0B·Tier 1

Higher risk

z-ai

private · undisclosed

Unknown

See the AI economy →

Head to head

24 benchmarks · 3 models

GLM 5Step 3.5 FlashGLM 4.7

Chatbot Arena Elo · Overall

GLM 5 leads by +12.9

GLM 5

1455.6

Step 3.5 Flash

1391.4

GLM 4.7

1442.7

OpenCompass · AIME2025

GLM 5 leads by +0.1

GLM 5

95.8

Step 3.5 Flash

95.7

GLM 4.7

95.4

OpenCompass · GPQA-Diamond

GLM 4.7 leads by +1.6

GLM 5

85.3

Step 3.5 Flash

83.7

GLM 4.7

86.9

OpenCompass · HLE

GLM 5 leads by +2.7

GLM 5

28.1

Step 3.5 Flash

21.6

GLM 4.7

25.4

OpenCompass · IFEval

GLM 5

93.2

Step 3.5 Flash

93.2

GLM 4.7

90.2

OpenCompass · LiveCodeBenchV6

GLM 5 leads by +2.3

GLM 5

86.2

Step 3.5 Flash

83.9

GLM 4.7

83.8

OpenCompass · MMLU-Pro

GLM 5 leads by +1.2

GLM 5

85.2

Step 3.5 Flash

83.5

GLM 4.7

84.0

Chatbot Arena Elo · Coding

GLM 5 leads by +1.8

GLM 5

1441.0

GLM 4.7

1439.2

Chess Puzzles

GLM 5 leads by +4.0

Chess Puzzles · tests strategic and tactical reasoning by having models solve chess puzzle positions, evaluating lookahead and pattern recognition abilities.

GLM 5

10.0

GLM 4.7

6.0

FrontierMath-2025-02-28-Private

GLM 5 leads by +14.0

FrontierMath (Feb 2025) · original research-level math problems created by mathematicians, testing capabilities at the boundary of current AI mathematical reasoning.

GLM 5

16.4

GLM 4.7

2.4

FrontierMath-Tier-4-2025-07-01-Private

GLM 5 leads by +2.0

FrontierMath Tier 4 (Jul 2025) · the most challenging tier of frontier mathematics, containing problems that push the absolute limits of AI mathematical reasoning.

GLM 5

2.1

GLM 4.7

0.1

GPQA diamond

GLM 5 leads by +6.0

Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.

GLM 5

83.8

GLM 4.7

77.8

LiveBench · Agentic Coding

GLM 5 leads by +13.3

GLM 5

55.0

GLM 4.7

41.7

LiveBench · Coding

GLM 5 leads by +0.5

GLM 5

73.6

GLM 4.7

73.1

LiveBench · Data Analysis

GLM 5 leads by +12.7

GLM 5

67.9

GLM 4.7

55.2

LiveBench · If

GLM 5 leads by +19.7

GLM 5

55.3

GLM 4.7

35.7

LiveBench · Language

GLM 5 leads by +12.3

GLM 5

77.5

GLM 4.7

65.2

LiveBench · Mathematics

GLM 5 leads by +7.4

GLM 5

83.5

GLM 4.7

76.0

LiveBench · Overall

GLM 5 leads by +10.8

GLM 5

68.8

GLM 4.7

58.1

LiveBench · Reasoning

GLM 5 leads by +9.4

GLM 5

69.1

GLM 4.7

59.7

OTIS Mock AIME 2024-2025

GLM 4.7 leads by +3.3

OTIS Mock AIME 2024-2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.

GLM 5

80.0

GLM 4.7

83.3

PostTrainBench

GLM 5 leads by +6.4

GLM 5

13.9

GLM 4.7

7.5

SimpleBench

GLM 5 leads by +6.6

SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.

GLM 5

43.8

GLM 4.7

37.2

Terminal Bench

GLM 5 leads by +19.0

Terminal-Bench 2.0 · evaluates AI agents on real terminal-based coding tasks · writing scripts, debugging, running tests, and managing projects entirely through command-line interaction. Tests both code quality and terminal fluency. Claude Opus 4.7 scores 69.4%, demonstrating significant agentic terminal competence.

GLM 5

52.4

GLM 4.7

33.4

Full benchmark table

Benchmark	GLM 5	Step 3.5 Flash	GLM 4.7
Chatbot Arena Elo · Overall	1455.6	1391.4	1442.7
OpenCompass · AIME2025	95.8	95.7	95.4
OpenCompass · GPQA-Diamond	85.3	83.7	86.9
OpenCompass · HLE	28.1	21.6	25.4
OpenCompass · IFEval	93.2	93.2	90.2
OpenCompass · LiveCodeBenchV6	86.2	83.9	83.8
OpenCompass · MMLU-Pro	85.2	83.5	84.0
Chatbot Arena Elo · Coding	1441.0	—	1439.2
Chess Puzzles Chess Puzzles · tests strategic and tactical reasoning by having models solve chess puzzle positions, evaluating lookahead and pattern recognition abilities.	10.0	—	6.0
FrontierMath-2025-02-28-Private FrontierMath (Feb 2025) · original research-level math problems created by mathematicians, testing capabilities at the boundary of current AI mathematical reasoning.	16.4	—	2.4
FrontierMath-Tier-4-2025-07-01-Private FrontierMath Tier 4 (Jul 2025) · the most challenging tier of frontier mathematics, containing problems that push the absolute limits of AI mathematical reasoning.	2.1	—	0.1
GPQA diamond Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.	83.8	—	77.8
LiveBench · Agentic Coding	55.0	—	41.7
LiveBench · Coding	73.6	—	73.1
LiveBench · Data Analysis	67.9	—	55.2
LiveBench · If	55.3	—	35.7
LiveBench · Language	77.5	—	65.2
LiveBench · Mathematics	83.5	—	76.0
LiveBench · Overall	68.8	—	58.1
LiveBench · Reasoning	69.1	—	59.7
OTIS Mock AIME 2024-2025 OTIS Mock AIME 2024-2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.	80.0	—	83.3
PostTrainBench	13.9	—	7.5
SimpleBench SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.	43.8	—	37.2
Terminal Bench Terminal-Bench 2.0 · evaluates AI agents on real terminal-based coding tasks · writing scripts, debugging, running tests, and managing projects entirely through command-line interaction. Tests both code quality and terminal fluency. Claude Opus 4.7 scores 69.4%, demonstrating significant agentic terminal competence.	52.4	—	33.4

Pricing · per 1M tokens · projected $/mo at 10M tokens

Model	Input	Output	Context	Projected $/mo
GLM 5	$0.60	$1.92	203K tokens (~101 books)	$9.30
Step 3.5 Flash	$0.10	$0.30	262K tokens (~131 books)	$1.50
GLM 4.7	$0.38	$1.74	203K tokens (~101 books)	$7.20

People also compared

GPT-5.5 Pro vs Step 3.5 Flash GPT-5.5 vs Step 3.5 Flash GPT-5 Chat vs Step 3.5 Flash Claude Mythos Preview vs Step 3.5 Flash Qwen3.5 397B A17B vs Step 3.5 Flash DeepSeek V3.2 Speciale vs Step 3.5 Flash Claude Instant vs Step 3.5 Flash DeepSeek-V2 (MoE-236B, May 2024) vs Step 3.5 Flash