IFEval
The Frontier
Best score over time · one chart, every benchmark
Distribution
Where models cluster
Correlated benchmarks
Pearson r · original research
Benchmarks that track with IFEval
Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.
Full rankings
73 models tested · sorted by score
Frequently asked
Pulled from the IFEval dataset · updated daily
What does IFEval measure?
IFEval is a knowledge benchmark in the BenchGecko catalog. 73 AI models have been tested on it. Scores range from 6.0 to 90.0 out of 100.
Which model leads on IFEval?
Llama 3.3 70B Instruct from Meta leads IFEval with a score of 90.0. The median score across 73 tested models is 39.8.
Is IFEval saturated?
No · the top score is 90.0 out of 100 (90%). There is still meaningful room for improvement on IFEval.
Does IFEval predict performance on other benchmarks?
Yes · IFEval scores correlate 0.85 with GSM8K across 13 shared models. Models that do well on IFEval tend to do well on GSM8K.
How often is IFEval data refreshed?
BenchGecko pulls updates daily. New model scores on IFEval appear as soon as they are published by Epoch AI or the model provider.
More knowledge benchmarks
Same category · related evaluations