LIVETracking 994 AI models from 267 providers.

Gecko Tests·powered by GeckoBench · AI Bias, Censorship, IQ & Politics View Gecko Tests Build your own chart

Home/Models/Grok 4

Grok 4

by xAI · Released Jul 2025

Multimodal

62.2

avg score

Rank #59

Better than 75% of all models

Context

256K tokens (~128 books)

Input $/1M

$3.00

Output $/1M

$15.00

Type

multimodal

License

Proprietary

Benchmarks

24 tested

Data updated today

About

Grok 4 is xAI's latest reasoning model with a 256k context window. It supports parallel tool calling, structured outputs, and both image and text inputs. Note that reasoning is not...

Tested on 24 benchmarks with 54.8% average. Top scores: HELM — IFEval (94.9%), Fiction.LiveBench (94.4%), HELM — MMLU-Pro (85.1%).

Looking for similar performance at lower cost?
MiniMax M2.5 scores 61.7 (99% as good) at $0.15/1M input · 95% cheaper

Capabilities

coding

48.9

#71 globally

reasoning

53.7

#43 globally

math

41.5

#103 globally

knowledge

62.8

#35 globally

agentic

15.2

#25 globally

language

94.9

#2 globally

Benchmark Scores

Tested on 24 benchmarks · Ranked across 6 categories

Score Distribution (all 233 models)

0255075100

▲ You are here

codingCompare coding →

Multi-language code editing from Aider. Tests editing ability across Python, JavaScript, TypeScript, Java, C++, Go, Rust, and more.

79.6—

Unusual and adversarial machine learning challenges. Tests robustness of reasoning about edge cases in ML systems.

45.7—

Capture-the-flag cybersecurity challenges. Tests vulnerability analysis, reverse engineering, cryptography, and exploitation skills.

43.0—

reasoningCompare reasoning →

HELM — WildBench

Stanford HELM WildBench evaluation. Tests reasoning on challenging real-world tasks.

79.7—

Abstraction and Reasoning Corpus. Tests fluid intelligence through novel visual pattern recognition puzzles. Core measure of general intelligence.

66.7—

Deceptively simple questions that humans find easy but AI models often get wrong. Tests common sense and reasoning gaps.

52.6—

mathCompare math →

OTIS Mock AIME 2024-2025

Mock AIME (American Invitational Mathematics Exam) problems from OTIS. Tests mathematical competition performance.

84.0—

HELM — Omni-MATH

Stanford HELM evaluation of mathematical reasoning across diverse problem types.

60.3—

FrontierMath-2025-02-28-Private

Original research-level math problems created by professional mathematicians. Problems are unpublished and cannot be memorized.

19.7—

Quick compare:

vs GPT-3.5 Turbo (older v0613)

vs MiniMax M2.5

Excellent (85+) Good (70-85) Average (50-70) Below (<50)

Model Family · xAI Grok 4

$3.00/M in256Kctx24 benchmarks

Grok 4 FastSep 2025

$0.20/M in(-2.80)2.0Mctx(+1.7M)6 benchmarks

See the full Grok 4 family →

Recently Happened

Grok 4 posted 89.4% on GPQA Diamond

Mar 8, 2026

Similar Models

GPT-3.5 Turbo (older v0613)

Links

Info

xAI Pricing explorer Developers · API

Research

Technical Report

Documentation

API Docs Playground

Community

BenchGecko API

grok-4

Specifications

Typemultimodal
Context256K tokens (~128 books)
ReleasedJul 2025
LicenseProprietary
StatusActive
Cost / Message~$0.021

Available On

xAI$3.00

Categories

coding reasoning math knowledge agentic language

Learn More

context-window transformer tokens

Share & Export

Related Models

GPT-3.5 Turbo (older v0613)

WizardLM-2 8x22B

Frequently Asked Questions

Grok 4 is a proprietary multimodal AI model by xAI, released in July 2025. It has an average benchmark score of 62.2. Context window: 256K tokens.

Related Models

Qwen2-72B · Alibaba Qwen GPT-3.5 Turbo (older v0613) · OpenAI MiniMax M2.5 · minimax WizardLM-2 8x22B · Microsoft GPT-5 Mini · OpenAI

Benchmarks

HELM — IFEval Fiction.LiveBench HELM — MMLU-Pro OTIS Mock AIME 2024-2025 GPQA diamond

Related Pages

xAI · Provider xAI · Economy All Models Compare Models Pricing Developers · API Context Window · Glossary