Compare · ModelsLive · 2 picked · head to head

GPT-5.1 vs GLM 4.7

Side by side · benchmarks, pricing, and signals you can act on.

CiteAdd another

Winner summary

GPT-5.1 wins on 9/11 benchmarks

GPT-5.1 wins 9 of 11 shared benchmarks. Leads in agentic · knowledge · math.

Category leads

agentic·GPT-5.1arena·GLM 4.7knowledge·GPT-5.1math·GPT-5.1reasoning·GPT-5.1coding·GPT-5.1

Hype vs Reality

Attention vs performance

GPT-5.1

#97 by perf·no signal

QUIET

GLM 4.7

#93 by perf·no signal

QUIET

See full mindshare →

Best value

GLM 4.7

5.4x better value than GPT-5.1

GPT-5.1

8.8 pts/$

$5.63/M

GLM 4.7

47.6 pts/$

$1.06/M

Explore pricing →

Vendor risk

Who is behind the model

OpenAI

$840.0B·Tier 1

Medium risk

z-ai

private · undisclosed

Unknown

See the AI economy →

Head to head

11 benchmarks · 2 models

GPT-5.1GLM 4.7

APEX-Agents

GPT-5.1 leads by +14.4

APEX-Agents · evaluates AI agents on complex, multi-step tasks requiring planning, tool use, and autonomous decision-making in realistic environments.

GPT-5.1

17.5

GLM 4.7

3.1

Chatbot Arena Elo · Coding

GLM 4.7 leads by +100.4

GPT-5.1

1338.8

GLM 4.7

1439.2

Chatbot Arena Elo · Overall

GLM 4.7 leads by +4.2

GPT-5.1

1438.5

GLM 4.7

1442.7

Chess Puzzles

GPT-5.1 leads by +26.0

Chess Puzzles · tests strategic and tactical reasoning by having models solve chess puzzle positions, evaluating lookahead and pattern recognition abilities.

GPT-5.1

32.0

GLM 4.7

6.0

FrontierMath-2025-02-28-Private

GPT-5.1 leads by +28.6

FrontierMath (Feb 2025) · original research-level math problems created by mathematicians, testing capabilities at the boundary of current AI mathematical reasoning.

GPT-5.1

31.0

GLM 4.7

2.4

FrontierMath-Tier-4-2025-07-01-Private

GPT-5.1 leads by +12.4

FrontierMath Tier 4 (Jul 2025) · the most challenging tier of frontier mathematics, containing problems that push the absolute limits of AI mathematical reasoning.

GPT-5.1

12.5

GLM 4.7

0.1

GPQA diamond

GPT-5.1 leads by +5.7

Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.

GPT-5.1

83.5

GLM 4.7

77.8

OTIS Mock AIME 2024-2025

GPT-5.1 leads by +5.3

OTIS Mock AIME 2024-2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.

GPT-5.1

88.6

GLM 4.7

83.3

SimpleBench

GPT-5.1 leads by +6.6

SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.

GPT-5.1

43.8

GLM 4.7

37.2

SimpleQA Verified

GPT-5.1 leads by +17.4

SimpleQA Verified · short factual questions with verified answers, measuring factual accuracy and the tendency to hallucinate or provide incorrect information.

GPT-5.1

48.9

GLM 4.7

31.5

Terminal Bench

GPT-5.1 leads by +14.2

Terminal-Bench 2.0 · evaluates AI agents on real terminal-based coding tasks · writing scripts, debugging, running tests, and managing projects entirely through command-line interaction. Tests both code quality and terminal fluency. Claude Opus 4.7 scores 69.4%, demonstrating significant agentic terminal competence.

GPT-5.1

47.6

GLM 4.7

33.4

Full benchmark table

Benchmark	GPT-5.1	GLM 4.7
APEX-Agents APEX-Agents · evaluates AI agents on complex, multi-step tasks requiring planning, tool use, and autonomous decision-making in realistic environments.	17.5	3.1
Chatbot Arena Elo · Coding	1338.8	1439.2
Chatbot Arena Elo · Overall	1438.5	1442.7
Chess Puzzles Chess Puzzles · tests strategic and tactical reasoning by having models solve chess puzzle positions, evaluating lookahead and pattern recognition abilities.	32.0	6.0
FrontierMath-2025-02-28-Private FrontierMath (Feb 2025) · original research-level math problems created by mathematicians, testing capabilities at the boundary of current AI mathematical reasoning.	31.0	2.4
FrontierMath-Tier-4-2025-07-01-Private FrontierMath Tier 4 (Jul 2025) · the most challenging tier of frontier mathematics, containing problems that push the absolute limits of AI mathematical reasoning.	12.5	0.1
GPQA diamond Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.	83.5	77.8
OTIS Mock AIME 2024-2025 OTIS Mock AIME 2024-2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.	88.6	83.3
SimpleBench SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.	43.8	37.2
SimpleQA Verified SimpleQA Verified · short factual questions with verified answers, measuring factual accuracy and the tendency to hallucinate or provide incorrect information.	48.9	31.5
Terminal Bench Terminal-Bench 2.0 · evaluates AI agents on real terminal-based coding tasks · writing scripts, debugging, running tests, and managing projects entirely through command-line interaction. Tests both code quality and terminal fluency. Claude Opus 4.7 scores 69.4%, demonstrating significant agentic terminal competence.	47.6	33.4

Pricing · per 1M tokens · projected $/mo at 10M tokens

Model	Input	Output	Context	Projected $/mo
GPT-5.1	$1.25	$10.00	400K tokens (~200 books)	$34.38
GLM 4.7	$0.38	$1.74	203K tokens (~101 books)	$7.20