Beta
Benchmark · Knowledge

IFEval

Updated 2025-07-24
Models tested
73
Top score
90.0
Llama 3.3 70B Instruct
Median
39.8
min 6.0
Top-5 spread
σ 2.1
competitive

Best score over time · one chart, every benchmark

IFEVAL48 MODELS · FRONTIER RUNNING MAX0255075100SCORE ↑Apr 24Aug 24Dec 24Mar 25Jul 25RELEASE DATE →benchgecko.ai/benchmark/hf-ifeval · frontier
Frontier on IFEval rose from 74.1 to 90.0 in 8 months · +15.9 points · latest leader Llama 3.3 70B Instruct from Meta.
Pink dots = frontier records · 5 totalClick to open model page

Where models cluster

SCORE DISTRIBUTION20–101010–201520–301130–40740–501050–60460–70870–80680–9090–100MEDIAN · 39.8SCORE BUCKET → (0 TO 100)MODELSbenchgecko.ai

Pearson r · original research

Correlation analysis

Benchmarks that track with IFEval

Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.

73 models tested · sorted by score

#ModelScore
1Meta logoLlama 3.3 70B Instruct90.0
2Meta logoLlama 3.1 70B Instruct86.7
3Alibaba Qwen logoQwen2.5 72B Instruct86.4
4
HA
Qwen2.5 72B Instruct Abliterated
85.9
5Alibaba logoQwen2.5 32B Instruct83.5
6Alibaba logoQwen2.5 14B Instruct81.4
7Google DeepMind logoGemma 2 27B79.8
8nousresearch logoHermes 3 70B Instruct76.6
9Alibaba Qwen logoQwen2.5 7B Instruct75.8
10Google DeepMind logoGemma 2 9B74.4
11Meta logoMeta Llama 3 8B Instruct74.1
12Meta logoLlama 3.2 3B Instruct73.9
13Microsoft logoPhi 4 Mini Instruct73.8
14Alibaba Qwen logoQwen2.5 Coder 32B Instruct72.7
15Alibaba logoQwen2.5 Coder 14B Instruct69.1
16Microsoft logoPhi 468.8
17Alibaba logoQwen2.5 3B Instruct64.8
18Alibaba Qwen logoQwen2.5 Coder 7B Instruct61.0
19Meta logoLlama 3.2 1B Instruct58.1
20Microsoft logoPhi 3.5 Mini Instruct57.8
21Alibaba logoQwen2 7B Instruct56.8
22Google DeepMind logoGemma 2 2b It56.7
23anthracite-org logoMagnum v4 72B56.3
24Mistral AI logoMistral 7B Instruct V0.255.0
25Microsoft logoPhi 3 Mini 4k Instruct54.8
26nousresearch logoHermes 2 Pro - Llama-3 8B53.6
27Microsoft logoWizardLM-2 8x22B52.7
28Meta logoLlama 3.1 8B Instruct50.6
29Alibaba logoQwen2 VL 7B Instruct46.0
30Mistral AI logoMistral 7B Instruct v0.144.9
31Alibaba logoQwen2.5 1.5B Instruct44.8
32DeepSeek logoDeepSeek R1 Distill Qwen 14B43.8
33DeepSeek logoR1 Distill Llama 70B43.4
34DeepSeek logoR1 Distill Qwen 32B41.9
35DeepSeek logoDeepSeek R1 Distill Qwen 7B40.4
36Meta logoLlama 2 7b Chat Hf39.9
37Alibaba Qwen logoQwQ 32B39.8
38
D
Dolphin 2.9.1 Yi 1.5 34b
38.5
39Alibaba Qwen logoQwen2-72B38.2
40
U
Stable Beluga 2
37.9
41DeepSeek logoDeepSeek R1 Distill Llama 8B37.8
42DeepSeek logoDeepSeek R1 Distill Qwen 1.5B34.6
43Alibaba logoQwen2 1.5B Instruct33.7
44TII logoFalcon-180B32.6
45Alibaba logoQwen2.5 0.5B Instruct31.5
46
U
Yi 6B
30.5
47
HF
SmolLM2 135M Instruct
28.8
48
U
StarCoder 2 15B
27.8
49Microsoft logoPhi 227.4
50Google DeepMind logoGemma 2B26.6
51Meta logoLLaMA-13B25.3
52Meta logoLlama 2 7b Hf25.2
53Meta logoLlama 3 8B Instruct24.0
54Mistral AI logoMistral 7B V0.123.9
55
L
Vicuna 7b V1.5
23.5
56Alibaba logoQwen2 0.5B Instruct22.5
57OpenAI logoGpt2 Medium22.1
58
U
MPT-30B
21.5
59OpenAI logoGpt2 Large20.5
60Microsoft logoPhi-1.520.3
61Google DeepMind logoGemma 2 2b20.2
62eleutherai logoGpt Neo 125m19.1
63Alibaba logoQwen2 0.5B18.7
64
HF
SmolLM2 135M
18.2
65eleutherai logoPythia 160m18.2
66Meta logoLlama 3.1 405B18.1
67OpenAI logoGpt217.9
68
U
INTELLECT-1
17.6
69Meta logoMeta Llama 3 8B16.0
70z-ai logoGLM 4 32B 14.3
71Meta logoLlama 3.2 3B Instruct (free)13.4
72
D
Distilgpt2
6.1
73
T
TinyLlama 1.1B Chat V1.0
6.0

Pulled from the IFEval dataset · updated daily

What does IFEval measure?

IFEval is a knowledge benchmark in the BenchGecko catalog. 73 AI models have been tested on it. Scores range from 6.0 to 90.0 out of 100.

Which model leads on IFEval?

Llama 3.3 70B Instruct from Meta leads IFEval with a score of 90.0. The median score across 73 tested models is 39.8.

Is IFEval saturated?

No · the top score is 90.0 out of 100 (90%). There is still meaningful room for improvement on IFEval.

Does IFEval predict performance on other benchmarks?

Yes · IFEval scores correlate 0.85 with GSM8K across 13 shared models. Models that do well on IFEval tend to do well on GSM8K.

How often is IFEval data refreshed?

BenchGecko pulls updates daily. New model scores on IFEval appear as soon as they are published by Epoch AI or the model provider.

Same category · related evaluations