Beta
Benchmark · Knowledge

HELM · IFEval

Updated 2026-01-21
Models tested
34
Top score
95.1
Grok 3 Mini Beta
Median
84.0
min 73.2
Top-5 spread
σ 0.9
settled

Best score over time · one chart, every benchmark

HELM · IFEVAL30 MODELS · FRONTIER RUNNING MAX0255075100SCORE ↑Jul 24Dec 24Apr 25Sep 25Jan 26RELEASE DATE →benchgecko.ai/benchmark/helm-ifeval · frontier
Frontier on HELM · IFEval rose from 78.2 to 95.1 in 9 months · +16.9 points · latest leader Grok 3 Mini Beta from xAI.
Pink dots = frontier records · 5 totalClick to open model page

Where models cluster

SCORE DISTRIBUTION0–1010–2020–3030–4040–5050–6060–70570–802280–90790–100MEDIAN · 84.0SCORE BUCKET → (0 TO 100)MODELSbenchgecko.ai

Pearson r · original research

34 models tested · sorted by score

Pulled from the HELM · IFEval dataset · updated daily

What does HELM · IFEval measure?

HELM · IFEval is a knowledge benchmark in the BenchGecko catalog. 34 AI models have been tested on it. Scores range from 73.2 to 95.1 out of 100.

Which model leads on HELM · IFEval?

Grok 3 Mini Beta from xAI leads HELM · IFEval with a score of 95.1. The median score across 34 tested models is 84.0.

Is HELM · IFEval saturated?

Yes · the top model on HELM · IFEval has reached 95.1 out of 100, within 5% of the theoretical ceiling. This benchmark is approaching saturation and may be replaced by a harder successor.

Does HELM · IFEval predict performance on other benchmarks?

Yes · HELM · IFEval scores correlate 0.93 with Cybench across 5 shared models. Models that do well on HELM · IFEval tend to do well on Cybench.

How often is HELM · IFEval data refreshed?

BenchGecko pulls updates daily. New model scores on HELM · IFEval appear as soon as they are published by Epoch AI or the model provider.

Same category · related evaluations