Benchmark · ReasoningCompetitive

BBH

BIG-Bench Hard · a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average humans.

Updated 2024-12-26
Models tested
24
Top score
83.3
DeepSeek V3
Median
45.4
min 10.0
Top-5 spread
σ 3.5
Competitive

Best score over time · one chart, every benchmark

BBH3 MODELS · FRONTIER RUNNING MAX0255075100SCORE ↑Jul 24Aug 24Oct 24Nov 24Dec 24RELEASE DATE →benchgecko.ai/benchmark/bbh · frontier
Frontier on BBH rose from 77.2 to 83.3 in 5 months · +6.1 points · latest leader DeepSeek V3 from DeepSeek.
Pink dots = frontier records · 2 totalClick to open model page
Details
Category
Reasoning
Max score
100
Models
24
Updated
2024-12-26

Same category · related evaluations