Benchmark · KnowledgeSettled

HELM · WildBench

Updated 2026-01-21
Models tested
34
Top score
86.3
GPT-5.1
Median
81.6
min 65.1
Top-5 spread
σ 0.2
Settled

Best score over time · one chart, every benchmark

HELM · WILDBENCH30 MODELS · FRONTIER RUNNING MAX0255075100SCORE ↑Jul 24Dec 24Apr 25Sep 25Jan 26RELEASE DATE →benchgecko.ai/benchmark/helm-wildbench · frontier
Frontier on HELM · WildBench rose from 79.1 to 86.3 in 16 months · +7.2 points · latest leader GPT-5.1 from OpenAI.
Pink dots = frontier records · 9 totalClick to open model page

Same category · related evaluations