Benchmark · KnowledgeSettled

BBH (HuggingFace)

Updated 2025-07-24
Models tested
73
Top score
61.9
Qwen2.5 72B Instruct
Median
22.9
min 1.0
Top-5 spread
σ 2.4
Competitive

Best score over time · one chart, every benchmark

BBH (HUGGINGFACE)44 MODELS · FRONTIER RUNNING MAX0255075100SCORE ↑May 24Sep 24Dec 24Apr 25Jul 25RELEASE DATE →benchgecko.ai/benchmark/hf-bbh · frontier
Frontier on BBH (HuggingFace) rose from 49.3 to 61.9 in 2 months · +12.6 points · latest leader Qwen2.5 72B Instruct from Alibaba Qwen.
Pink dots = frontier records · 4 totalClick to open model page

73 models tested · sorted by score

#ModelScore
1Alibaba Qwen logoQwen2.5 72B Instruct61.9
2
HA
Qwen2.5 72B Instruct Abliterated
60.5
3Meta logoLlama 3.3 70B Instruct56.6
4Alibaba logoQwen2.5 32B Instruct56.5
5Meta logoLlama 3.1 70B Instruct55.9
6Microsoft logoPhi 455.3
7nousresearch logoHermes 3 70B Instruct53.8
8Alibaba Qwen logoQwen2.5 Coder 32B Instruct52.3
9Alibaba Qwen logoQwen2-72B51.9
10Google DeepMind logoGemma 2 27B49.3
11Meta logoMeta Llama 3 8B48.7
12Microsoft logoWizardLM-2 8x22B48.6
13Alibaba logoQwen2.5 14B Instruct48.6
14Alibaba logoQwen2.5 Coder 14B Instruct44.2
15
D
Dolphin 2.9.1 Yi 1.5 34b
44.2
16Google DeepMind logoGemma 2 9B42.1
17
U
Stable Beluga 2
41.3
18DeepSeek logoDeepSeek R1 Distill Qwen 14B40.7
19Microsoft logoPhi 4 Mini Instruct38.7
20Alibaba logoQwen2 7B Instruct37.8
21Microsoft logoPhi 3.5 Mini Instruct36.8
22Microsoft logoPhi 3 Mini 4k Instruct36.6
23Alibaba logoQwen2 VL 7B Instruct35.9
24DeepSeek logoR1 Distill Llama 70B35.8
25z-ai logoGLM 4 32B 35.8
26anthracite-org logoMagnum v4 72B35.5
27
U
Yi 6B
35.5
28Alibaba Qwen logoQwen2.5 7B Instruct34.9
29nousresearch logoHermes 2 Pro - Llama-3 8B30.7
30Meta logoLlama 3.1 8B Instruct29.2
31Alibaba Qwen logoQwen2.5 Coder 7B Instruct28.9
32Meta logoMeta Llama 3 8B Instruct28.2
33Microsoft logoPhi 228.0
34Alibaba logoQwen2.5 3B Instruct25.8
35Meta logoLLaMA-13B25.3
36Meta logoLlama 3.2 3B Instruct24.1
37Mistral AI logoMistral 7B Instruct V0.222.9
38Mistral AI logoMistral 7B V0.122.0
39TII logoFalcon-180B21.9
40Google DeepMind logoGemma 2B21.1
41
U
StarCoder 2 15B
20.4
42Alibaba logoQwen2.5 1.5B Instruct19.8
43Meta logoLlama 3 8B Instruct18.4
44Google DeepMind logoGemma 2 2b It18.0
45DeepSeek logoR1 Distill Qwen 32B17.1
46
L
Vicuna 7b V1.5
15.2
47Meta logoLlama 3.2 3B Instruct (free)14.2
48Alibaba logoQwen2 1.5B Instruct13.7
49Google DeepMind logoGemma 2 2b12.5
50Meta logoLlama 2 7b Hf10.3
51Meta logoLlama 3.2 1B Instruct8.3
52Alibaba logoQwen2.5 0.5B Instruct8.2
53Alibaba logoQwen2 0.5B7.9
54DeepSeek logoDeepSeek R1 Distill Qwen 7B7.9
55Meta logoLlama 3.1 405B7.8
56Mistral AI logoMistral 7B Instruct v0.17.7
57Microsoft logoPhi-1.57.5
58
U
MPT-30B
6.5
59Alibaba logoQwen2 0.5B Instruct5.9
60DeepSeek logoDeepSeek R1 Distill Llama 8B5.3
61DeepSeek logoDeepSeek R1 Distill Qwen 1.5B4.7
62
HF
SmolLM2 135M Instruct
4.7
63Meta logoLlama 2 7b Chat Hf4.5
64
T
TinyLlama 1.1B Chat V1.0
4.0
65
HF
SmolLM2 135M
3.7
66eleutherai logoGpt Neo 125m3.4
67OpenAI logoGpt2 Large3.3
68Alibaba Qwen logoQwQ 32B2.9
69
D
Distilgpt2
2.8
70OpenAI logoGpt2 Medium2.7
71OpenAI logoGpt22.7
72eleutherai logoPythia 160m2.2
73
U
INTELLECT-1
1.0

Same category · related evaluations