BBH (HuggingFace)
The Frontier
Best score over time · one chart, every benchmark
Distribution
Where models cluster
Correlated benchmarks
Pearson r · original research
Benchmarks that track with BBH (HuggingFace)
Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.
Full rankings
73 models tested · sorted by score
Frequently asked
Pulled from the BBH (HuggingFace) dataset · updated daily
What does BBH (HuggingFace) measure?
BBH (HuggingFace) is a knowledge benchmark in the BenchGecko catalog. 73 AI models have been tested on it. Scores range from 1.0 to 61.9 out of 100.
Which model leads on BBH (HuggingFace)?
Qwen2.5 72B Instruct from Alibaba Qwen leads BBH (HuggingFace) with a score of 61.9. The median score across 73 tested models is 22.9.
Is BBH (HuggingFace) saturated?
No · the top score is 61.9 out of 100 (62%). There is still meaningful room for improvement on BBH (HuggingFace).
Does BBH (HuggingFace) predict performance on other benchmarks?
Yes · BBH (HuggingFace) scores correlate 0.94 with MMLU-PRO across 73 shared models. Models that do well on BBH (HuggingFace) tend to do well on MMLU-PRO.
How often is BBH (HuggingFace) data refreshed?
BenchGecko pulls updates daily. New model scores on BBH (HuggingFace) appear as soon as they are published by Epoch AI or the model provider.
More knowledge benchmarks
Same category · related evaluations