#	Model	Score	Price
1	Qwen2.5 72B Instruct· Alibaba Qwen	61.9	$0.12
2	HA Qwen2.5 72B Instruct Abliterated· HuiHui AI	60.5	—
3	Llama 3.3 70B Instruct· Meta	56.6	$0.10
4	Qwen2.5 32B Instruct· Alibaba	56.5	—
5	Llama 3.1 70B Instruct· Meta	55.9	$0.40
6	Phi 4· Microsoft	55.3	$0.07
7	Hermes 3 70B Instruct· nousresearch	53.8	$0.30
8	Qwen2.5 Coder 32B Instruct· Alibaba Qwen	52.3	$0.66
9	Qwen2-72B· Alibaba Qwen	51.9	—
10	Gemma 2 27B· Google DeepMind	49.3	$0.65
11	Meta Llama 3 8B· Meta	48.7	—
12	WizardLM-2 8x22B· Microsoft	48.6	$0.62
13	Qwen2.5 14B Instruct· Alibaba	48.6	—
14	Qwen2.5 Coder 14B Instruct· Alibaba	44.2	—
15	D Dolphin 2.9.1 Yi 1.5 34b· DPHN	44.2	—
16	Gemma 2 9B· Google DeepMind	42.1	$0.03
17	U Stable Beluga 2· Unknown	41.3	—
18	DeepSeek R1 Distill Qwen 14B· DeepSeek	40.7	—
19	Phi 4 Mini Instruct· Microsoft	38.7	—
20	Qwen2 7B Instruct· Alibaba	37.8	—
21	Phi 3.5 Mini Instruct· Microsoft	36.8	—
22	Phi 3 Mini 4k Instruct· Microsoft	36.6	—
23	Qwen2 VL 7B Instruct· Alibaba	35.9	—
24	R1 Distill Llama 70B· DeepSeek	35.8	$0.70
25	GLM 4 32B · z-ai	35.8	$0.10
26	Magnum v4 72B· anthracite-org	35.5	$3.00
27	U Yi 6B· Unknown	35.5	—
28	Qwen2.5 7B Instruct· Alibaba Qwen	34.9	$0.04
29	Hermes 2 Pro - Llama-3 8B· nousresearch	30.7	$0.14
30	Llama 3.1 8B Instruct· Meta	29.2	$0.02
31	Qwen2.5 Coder 7B Instruct· Alibaba Qwen	28.9	$0.03
32	Meta Llama 3 8B Instruct· Meta	28.2	—
33	Phi 2· Microsoft	28.0	—
34	Qwen2.5 3B Instruct· Alibaba	25.8	—
35	LLaMA-13B· Meta	25.3	—
36	Llama 3.2 3B Instruct· Meta	24.1	$0.05
37	Mistral 7B Instruct V0.2· Mistral AI	22.9	—
38	Mistral 7B V0.1· Mistral AI	22.0	—
39	Falcon-180B· TII	21.9	—
40	Gemma 2B· Google DeepMind	21.1	—
41	U StarCoder 2 15B· Unknown	20.4	—
42	Qwen2.5 1.5B Instruct· Alibaba	19.8	—
43	Llama 3 8B Instruct· Meta	18.4	$0.03
44	Gemma 2 2b It· Google DeepMind	18.0	—
45	R1 Distill Qwen 32B· DeepSeek	17.1	$0.29
46	L Vicuna 7b V1.5· LMSYS	15.2	—
47	Llama 3.2 3B Instruct (free)· Meta	14.2	$0.00
48	Qwen2 1.5B Instruct· Alibaba	13.7	—
49	Gemma 2 2b· Google DeepMind	12.5	—
50	Llama 2 7b Hf· Meta	10.3	—
51	Llama 3.2 1B Instruct· Meta	8.3	$0.03
52	Qwen2.5 0.5B Instruct· Alibaba	8.2	—
53	Qwen2 0.5B· Alibaba	7.9	—
54	DeepSeek R1 Distill Qwen 7B· DeepSeek	7.9	—
55	Llama 3.1 405B· Meta	7.8	—
56	Mistral 7B Instruct v0.1· Mistral AI	7.7	$0.11
57	Phi-1.5· Microsoft	7.5	—
58	U MPT-30B· Unknown	6.5	—
59	Qwen2 0.5B Instruct· Alibaba	5.9	—
60	DeepSeek R1 Distill Llama 8B· DeepSeek	5.3	—
61	DeepSeek R1 Distill Qwen 1.5B· DeepSeek	4.7	—
62	HF SmolLM2 135M Instruct· Hugging Face TB	4.7	—
63	Llama 2 7b Chat Hf· Meta	4.5	—
64	T TinyLlama 1.1B Chat V1.0· TinyLlama	4.0	—
65	HF SmolLM2 135M· Hugging Face TB	3.7	—
66	Gpt Neo 125m· eleutherai	3.4	—
67	Gpt2 Large· OpenAI	3.3	—
68	QwQ 32B· Alibaba Qwen	2.9	$0.15
69	D Distilgpt2· DistilBERT	2.8	—
70	Gpt2 Medium· OpenAI	2.7	—
71	Gpt2· OpenAI	2.7	—
72	Pythia 160m· eleutherai	2.2	—
73	U INTELLECT-1· Unknown	1.0	—

Frequently asked

Pulled from the BBH (HuggingFace) dataset · updated daily

What does BBH (HuggingFace) measure?

BBH (HuggingFace) is a knowledge benchmark in the BenchGecko catalog. 73 AI models have been tested on it. Scores range from 1.0 to 61.9 out of 100.

Which model leads on BBH (HuggingFace)?

Qwen2.5 72B Instruct from Alibaba Qwen leads BBH (HuggingFace) with a score of 61.9. The median score across 73 tested models is 22.9.

Is BBH (HuggingFace) saturated?

No · the top score is 61.9 out of 100 (62%). There is still meaningful room for improvement on BBH (HuggingFace).

Does BBH (HuggingFace) predict performance on other benchmarks?

Yes · BBH (HuggingFace) scores correlate 0.94 with MMLU-PRO across 73 shared models. Models that do well on BBH (HuggingFace) tend to do well on MMLU-PRO.

How often is BBH (HuggingFace) data refreshed?

BenchGecko pulls updates daily. New model scores on BBH (HuggingFace) appear as soon as they are published by Epoch AI or the model provider.