Beta
Benchmark · Knowledge

BBH (HuggingFace)

Updated 2025-07-24
Models tested
73
Top score
61.9
Qwen2.5 72B Instruct
Median
22.9
min 1.0
Top-5 spread
σ 2.4
competitive

Best score over time · one chart, every benchmark

BBH (HUGGINGFACE)48 MODELS · FRONTIER RUNNING MAX0255075100SCORE ↑Apr 24Aug 24Dec 24Mar 25Jul 25RELEASE DATE →benchgecko.ai/benchmark/hf-bbh · frontier
Frontier on BBH (HuggingFace) rose from 48.7 to 61.9 in 5 months · +13.2 points · latest leader Qwen2.5 72B Instruct from Alibaba Qwen.
Pink dots = frontier records · 5 totalClick to open model page

Where models cluster

SCORE DISTRIBUTION230–10910–201220–301130–40940–50750–60260–7070–8080–9090–100MEDIAN · 22.9SCORE BUCKET → (0 TO 100)MODELSbenchgecko.ai

Pearson r · original research

Correlation analysis

Benchmarks that track with BBH (HuggingFace)

Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.

73 models tested · sorted by score

#ModelScore
1Alibaba Qwen logoQwen2.5 72B Instruct61.9
2
HA
Qwen2.5 72B Instruct Abliterated
60.5
3Meta logoLlama 3.3 70B Instruct56.6
4Alibaba logoQwen2.5 32B Instruct56.5
5Meta logoLlama 3.1 70B Instruct55.9
6Microsoft logoPhi 455.3
7nousresearch logoHermes 3 70B Instruct53.8
8Alibaba Qwen logoQwen2.5 Coder 32B Instruct52.3
9Alibaba Qwen logoQwen2-72B51.9
10Google DeepMind logoGemma 2 27B49.3
11Meta logoMeta Llama 3 8B48.7
12Microsoft logoWizardLM-2 8x22B48.6
13Alibaba logoQwen2.5 14B Instruct48.6
14Alibaba logoQwen2.5 Coder 14B Instruct44.2
15
D
Dolphin 2.9.1 Yi 1.5 34b
44.2
16Google DeepMind logoGemma 2 9B42.1
17
U
Stable Beluga 2
41.3
18DeepSeek logoDeepSeek R1 Distill Qwen 14B40.7
19Microsoft logoPhi 4 Mini Instruct38.7
20Alibaba logoQwen2 7B Instruct37.8
21Microsoft logoPhi 3.5 Mini Instruct36.8
22Microsoft logoPhi 3 Mini 4k Instruct36.6
23Alibaba logoQwen2 VL 7B Instruct35.9
24DeepSeek logoR1 Distill Llama 70B35.8
25z-ai logoGLM 4 32B 35.8
26anthracite-org logoMagnum v4 72B35.5
27
U
Yi 6B
35.5
28Alibaba Qwen logoQwen2.5 7B Instruct34.9
29nousresearch logoHermes 2 Pro - Llama-3 8B30.7
30Meta logoLlama 3.1 8B Instruct29.2
31Alibaba Qwen logoQwen2.5 Coder 7B Instruct28.9
32Meta logoMeta Llama 3 8B Instruct28.2
33Microsoft logoPhi 228.0
34Alibaba logoQwen2.5 3B Instruct25.8
35Meta logoLLaMA-13B25.3
36Meta logoLlama 3.2 3B Instruct24.1
37Mistral AI logoMistral 7B Instruct V0.222.9
38Mistral AI logoMistral 7B V0.122.0
39TII logoFalcon-180B21.9
40Google DeepMind logoGemma 2B21.1
41
U
StarCoder 2 15B
20.4
42Alibaba logoQwen2.5 1.5B Instruct19.8
43Meta logoLlama 3 8B Instruct18.4
44Google DeepMind logoGemma 2 2b It18.0
45DeepSeek logoR1 Distill Qwen 32B17.1
46
L
Vicuna 7b V1.5
15.2
47Meta logoLlama 3.2 3B Instruct (free)14.2
48Alibaba logoQwen2 1.5B Instruct13.7
49Google DeepMind logoGemma 2 2b12.5
50Meta logoLlama 2 7b Hf10.3
51Meta logoLlama 3.2 1B Instruct8.3
52Alibaba logoQwen2.5 0.5B Instruct8.2
53Alibaba logoQwen2 0.5B7.9
54DeepSeek logoDeepSeek R1 Distill Qwen 7B7.9
55Meta logoLlama 3.1 405B7.8
56Mistral AI logoMistral 7B Instruct v0.17.7
57Microsoft logoPhi-1.57.5
58
U
MPT-30B
6.5
59Alibaba logoQwen2 0.5B Instruct5.9
60DeepSeek logoDeepSeek R1 Distill Llama 8B5.3
61DeepSeek logoDeepSeek R1 Distill Qwen 1.5B4.7
62
HF
SmolLM2 135M Instruct
4.7
63Meta logoLlama 2 7b Chat Hf4.5
64
T
TinyLlama 1.1B Chat V1.0
4.0
65
HF
SmolLM2 135M
3.7
66eleutherai logoGpt Neo 125m3.4
67OpenAI logoGpt2 Large3.3
68Alibaba Qwen logoQwQ 32B2.9
69
D
Distilgpt2
2.8
70OpenAI logoGpt2 Medium2.7
71OpenAI logoGpt22.7
72eleutherai logoPythia 160m2.2
73
U
INTELLECT-1
1.0

Pulled from the BBH (HuggingFace) dataset · updated daily

What does BBH (HuggingFace) measure?

BBH (HuggingFace) is a knowledge benchmark in the BenchGecko catalog. 73 AI models have been tested on it. Scores range from 1.0 to 61.9 out of 100.

Which model leads on BBH (HuggingFace)?

Qwen2.5 72B Instruct from Alibaba Qwen leads BBH (HuggingFace) with a score of 61.9. The median score across 73 tested models is 22.9.

Is BBH (HuggingFace) saturated?

No · the top score is 61.9 out of 100 (62%). There is still meaningful room for improvement on BBH (HuggingFace).

Does BBH (HuggingFace) predict performance on other benchmarks?

Yes · BBH (HuggingFace) scores correlate 0.94 with MMLU-PRO across 73 shared models. Models that do well on BBH (HuggingFace) tend to do well on MMLU-PRO.

How often is BBH (HuggingFace) data refreshed?

BenchGecko pulls updates daily. New model scores on BBH (HuggingFace) appear as soon as they are published by Epoch AI or the model provider.

Same category · related evaluations