How much does Grok 3 cost?

Grok 3 pricing information is not yet available.

What benchmarks has Grok 3 been tested on?

Grok 3 has been evaluated on 19 benchmarks. Top scores: MATH level 5: 88.8, HELM — IFEval: 88.4, HELM — WildBench: 84.9.

Is Grok 3 open source?

No, Grok 3 is a proprietary model by xAI.

How does Grok 3 compare to Qwen3 235B A22B Instruct 2507?

Grok 3 has an average score of 44.8 while Qwen3 235B A22B Instruct 2507 scores 44.8. Qwen3 235B A22B Instruct 2507 slightly outperforms Grok 3 overall. See full comparison →

Home/Models/Grok 3

Grok 3

Name: Grok 3
Author: xAI

by xAI · Released Jan 2024

44.8

avg score

Rank #150

Compare

Better than 45% of all models

Context

N/A

Input $/1M

TBD

Output $/1M

TBD

Type

text

License

Proprietary

Benchmarks

19 tested

Data updated today

About

Tested on 19 benchmarks with 45.5% average. Top scores: MATH level 5 (88.8%), HELM — IFEval (88.4%), HELM — WildBench (84.9%).

Capabilities

coding

45.3

#101 globally

reasoning

28.4

#109 globally

math

38.9

#129 globally

knowledge

62.6

#46 globally

agentic

2.1

#50 globally

language

88.4

#20 globally

Benchmark Scores

Compare All

Tested on 19 benchmarks · Ranked across 6 categories

Score Distribution (all 274 models)

0255075100

▲ You are here

codingCompare coding →

Aider polyglot

Multi-language code editing from Aider. Tests editing ability across Python, JavaScript, TypeScript, Java, C++, Go, Rust, and more.

53.3—

WeirdML

Unusual and adversarial machine learning challenges. Tests robustness of reasoning about edge cases in ML systems.

37.2—

reasoningCompare reasoning →

HELM — WildBench

Stanford HELM WildBench evaluation. Tests reasoning on challenging real-world tasks.

84.9—

SimpleBench

Deceptively simple questions that humans find easy but AI models often get wrong. Tests common sense and reasoning gaps.

23.3—

ARC-AGI

Abstraction and Reasoning Corpus. Tests fluid intelligence through novel visual pattern recognition puzzles. Core measure of general intelligence.

5.5—

mathCompare math →

MATH level 5

Competition-level math from AMC, AIME, and olympiad problems. Level 5 is the hardest tier, requiring creative problem-solving.

88.8—

OTIS Mock AIME 2024-2025

Mock AIME (American Invitational Mathematics Exam) problems from OTIS. Tests mathematical competition performance.

55.5—

HELM — Omni-MATH

Stanford HELM evaluation of mathematical reasoning across diverse problem types.

46.4—

Quick compare:

vs Qwen3 235B A22B Instruct 2507

vs o1-preview

vs Gemini 2.0 Flash

Excellent (85+) Good (70-85) Average (50-70) Below (<50)

Model Family · xAI Grok 3

Grok 3Jun 2025

38.4

$3.00/M in131Kctx13 benchmarks

Grok 3 BetaApr 2025

69.5+31.1

$3.00/M in131Kctx6 benchmarks

Grok 3 MiniJun 2025

46.6-22.9

$0.30/M in(-2.70)131Kctx11 benchmarks

Grok 3 Mini BetaApr 2025

64.8+18.2

$0.30/M in131Kctx7 benchmarks

See the full Grok 3 family →

Similar Models

Qwen3 235B A22B Instruct 2507

Links

Info

xAI Pricing explorer Developers · API

Research

Technical Report

Documentation

API Docs Playground

Community

@xAI

BenchGecko API

grok-3

Specifications

Typetext
ContextN/A
ReleasedJan 2024
LicenseProprietary
Statusbenchmark-only

Available On

xAITBD

Frequently Asked Questions

Grok 3 is a proprietary text AI model by xAI, released in January 2024. It has an average benchmark score of 44.8.

Benchmarks

MATH level 5 HELM — IFEval HELM — WildBench HELM — MMLU-Pro Lech Mazur Writing

xAI · Provider xAI · Economy All Models Compare Models Pricing Developers · API

Grok 3

Frequently Asked Questions

Related Models

Benchmarks

Related Pages