Compare · ModelsLive · 2 picked · head to head

GLM 4.7 vs GPT-5.1

Side by side · benchmarks, pricing, and signals you can act on.

CiteAdd another

Winner summary

GPT-5.1 wins on 9/11 benchmarks

GPT-5.1 wins 9 of 11 shared benchmarks. Leads in agentic · knowledge · math.

Category leads

agentic·GPT-5.1arena·GLM 4.7knowledge·GPT-5.1math·GPT-5.1reasoning·GPT-5.1coding·GPT-5.1

Hype vs Reality

Attention vs performance

GLM 4.7

#93 by perf·no signal

QUIET

GPT-5.1

#97 by perf·no signal

QUIET

See full mindshare →

Best value

GLM 4.7

5.4x better value than GPT-5.1

GLM 4.7

47.6 pts/$

$1.06/M

GPT-5.1

8.8 pts/$

$5.63/M

Explore pricing →

Vendor risk

Who is behind the model

z-ai

private · undisclosed

Unknown

OpenAI

$840.0B·Tier 1

Medium risk

See the AI economy →

Head to head

11 benchmarks · 2 models

GLM 4.7GPT-5.1

APEX-Agents

GPT-5.1 leads by +14.4

APEX-Agents · evaluates AI agents on complex, multi-step tasks requiring planning, tool use, and autonomous decision-making in realistic environments.

GLM 4.7

3.1

GPT-5.1

17.5

Chatbot Arena Elo · Coding

GLM 4.7 leads by +100.4

GLM 4.7

1439.2

GPT-5.1

1338.8

Chatbot Arena Elo · Overall

GLM 4.7 leads by +4.2

GLM 4.7

1442.7

GPT-5.1

1438.5

Chess Puzzles

GPT-5.1 leads by +26.0

Chess Puzzles · tests strategic and tactical reasoning by having models solve chess puzzle positions, evaluating lookahead and pattern recognition abilities.

GLM 4.7

6.0

GPT-5.1

32.0

FrontierMath-2025-02-28-Private

GPT-5.1 leads by +28.6

FrontierMath (Feb 2025) · original research-level math problems created by mathematicians, testing capabilities at the boundary of current AI mathematical reasoning.

GLM 4.7

2.4

GPT-5.1

31.0

FrontierMath-Tier-4-2025-07-01-Private

GPT-5.1 leads by +12.4

FrontierMath Tier 4 (Jul 2025) · the most challenging tier of frontier mathematics, containing problems that push the absolute limits of AI mathematical reasoning.

GLM 4.7

0.1

GPT-5.1

12.5

GPQA diamond

GPT-5.1 leads by +5.7

Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.

GLM 4.7

77.8

GPT-5.1

83.5

OTIS Mock AIME 2024-2025

GPT-5.1 leads by +5.3

OTIS Mock AIME 2024-2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.

GLM 4.7

83.3

GPT-5.1

88.6

SimpleBench

GPT-5.1 leads by +6.6

SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.

GLM 4.7

37.2

GPT-5.1

43.8

SimpleQA Verified

GPT-5.1 leads by +17.4

SimpleQA Verified · short factual questions with verified answers, measuring factual accuracy and the tendency to hallucinate or provide incorrect information.

GLM 4.7

31.5

GPT-5.1

48.9

Terminal Bench

GPT-5.1 leads by +14.2

Terminal-Bench 2.0 · evaluates AI agents on real terminal-based coding tasks · writing scripts, debugging, running tests, and managing projects entirely through command-line interaction. Tests both code quality and terminal fluency. Claude Opus 4.7 scores 69.4%, demonstrating significant agentic terminal competence.

GLM 4.7

33.4

GPT-5.1

47.6

Full benchmark table

Benchmark	GLM 4.7	GPT-5.1
APEX-Agents APEX-Agents · evaluates AI agents on complex, multi-step tasks requiring planning, tool use, and autonomous decision-making in realistic environments.	3.1	17.5
Chatbot Arena Elo · Coding	1439.2	1338.8
Chatbot Arena Elo · Overall	1442.7	1438.5
Chess Puzzles Chess Puzzles · tests strategic and tactical reasoning by having models solve chess puzzle positions, evaluating lookahead and pattern recognition abilities.	6.0	32.0
FrontierMath-2025-02-28-Private FrontierMath (Feb 2025) · original research-level math problems created by mathematicians, testing capabilities at the boundary of current AI mathematical reasoning.	2.4	31.0
FrontierMath-Tier-4-2025-07-01-Private FrontierMath Tier 4 (Jul 2025) · the most challenging tier of frontier mathematics, containing problems that push the absolute limits of AI mathematical reasoning.	0.1	12.5
GPQA diamond Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.	77.8	83.5
OTIS Mock AIME 2024-2025 OTIS Mock AIME 2024-2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.	83.3	88.6
SimpleBench SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.	37.2	43.8
SimpleQA Verified SimpleQA Verified · short factual questions with verified answers, measuring factual accuracy and the tendency to hallucinate or provide incorrect information.	31.5	48.9
Terminal Bench Terminal-Bench 2.0 · evaluates AI agents on real terminal-based coding tasks · writing scripts, debugging, running tests, and managing projects entirely through command-line interaction. Tests both code quality and terminal fluency. Claude Opus 4.7 scores 69.4%, demonstrating significant agentic terminal competence.	33.4	47.6

Pricing · per 1M tokens · projected $/mo at 10M tokens

Model	Input	Output	Context	Projected $/mo
GLM 4.7	$0.38	$1.74	203K tokens (~101 books)	$7.20
GPT-5.1	$1.25	$10.00	400K tokens (~200 books)	$34.38