HELM · IFEval
The Frontier
Best score over time · one chart, every benchmark
Distribution
Where models cluster
Correlated benchmarks
Pearson r · original research
Benchmarks that track with HELM · IFEval
Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.
Full rankings
34 models tested · sorted by score
| # | Model | Score |
|---|---|---|
| 1 | 95.1 | |
| 2 | 94.9 | |
| 3 | 93.5 | |
| 4 | 93.2 | |
| 5 | 92.9 | |
| 6 | 92.7 | |
| 7 | 90.4 | |
| 8 | 89.8 | |
| 9 | 88.4 | |
| 10 | 87.6 | |
| 11 | 87.6 | |
| 12 | 87.5 | |
| 13 | 86.9 | |
| 14 | 85.6 | |
| 15 | 85.0 | |
| 16 | 84.3 | |
| 17 | 84.1 | |
| 18 | 84.0 | |
| 19 | 83.8 | |
| 20 | 83.7 | |
| 21 | 83.6 | |
| 22 | 83.4 | |
| 23 | 83.2 | |
| 24 | 83.1 | |
| 25 | 82.4 | |
| 26 | 82.3 | |
| 27 | 81.7 | |
| 28 | 81.0 | |
| 29 | 81.0 | |
| 30 | 79.2 | |
| 31 | 78.4 | |
| 32 | 78.2 | |
| 33 | 75.0 | |
| 34 | 73.2 |
Frequently asked
Pulled from the HELM · IFEval dataset · updated daily
What does HELM · IFEval measure?
HELM · IFEval is a knowledge benchmark in the BenchGecko catalog. 34 AI models have been tested on it. Scores range from 73.2 to 95.1 out of 100.
Which model leads on HELM · IFEval?
Grok 3 Mini Beta from xAI leads HELM · IFEval with a score of 95.1. The median score across 34 tested models is 84.0.
Is HELM · IFEval saturated?
Yes · the top model on HELM · IFEval has reached 95.1 out of 100, within 5% of the theoretical ceiling. This benchmark is approaching saturation and may be replaced by a harder successor.
Does HELM · IFEval predict performance on other benchmarks?
Yes · HELM · IFEval scores correlate 0.93 with Cybench across 5 shared models. Models that do well on HELM · IFEval tend to do well on Cybench.
How often is HELM · IFEval data refreshed?
BenchGecko pulls updates daily. New model scores on HELM · IFEval appear as soon as they are published by Epoch AI or the model provider.
More knowledge benchmarks
Same category · related evaluations