Grok 4

von xAI · Veroeffentlicht 2025-07-09

54.8

Durchschn. Score

$3.00/1M

Eingabepreis

$15.00/1M

Ausgabepreis

256K tokens (~128 books)

Kontextfenster

multimodal

Typ

Tested on 24 benchmarks with 54.8% average. Top scores: HELM — IFEval (94.9%), Fiction.LiveBench (94.4%), HELM — MMLU-Pro (85.1%).

Benchmark-Ergebnisse

Benchmark	Kategorie	Score
HELM — IFEval	language	94.9
Fiction.LiveBench	knowledge	94.4
HELM — MMLU-Pro	knowledge	85.1
OTIS Mock AIME 2024-2025	math	84.0
GPQA diamond	knowledge	82.7
Lech Mazur Writing	knowledge	80.7
HELM — WildBench	reasoning	79.7
Aider polyglot	coding	79.6
HELM — GPQA	knowledge	72.6
ARC-AGI	reasoning	66.7
HELM — Omni-MATH	math	60.3
SimpleBench	reasoning	52.6
DeepResearch Bench	knowledge	47.9
SimpleQA Verified	knowledge	47.9
WeirdML	coding	45.7
GeoBench	knowledge	45.0
Balrog	knowledge	43.6
Cybench	coding	43.0
Chess Puzzles	knowledge	28.0
Terminal Bench	coding	27.2
FrontierMath-2025-02-28-Private	math	19.7
ARC-AGI-2	reasoning	16.0
APEX-Agents	agentic	15.2
FrontierMath-Tier-4-2025-07-01-Private	math	2.1