Beta
Benchmark · Knowledge

MUSR

Updated 2025-07-24
Models tested
73
Top score
28.7
DeepSeek R1 Distill Qwen 14B
Median
9.7
min 0.5
Top-5 spread
σ 3.7
competitive

Best score over time · one chart, every benchmark

MUSR48 MODELS · FRONTIER RUNNING MAX0255075100SCORE ↑Apr 24Aug 24Dec 24Mar 25Jul 25RELEASE DATE →benchgecko.ai/benchmark/hf-musr · frontier
Frontier on MUSR rose from 16.0 to 28.7 in 9 months · +12.7 points · latest leader DeepSeek R1 Distill Qwen 14B from DeepSeek.
Pink dots = frontier records · 4 totalClick to open model page

Where models cluster

SCORE DISTRIBUTION380–103310–20220–3030–4040–5050–6060–7070–8080–9090–100MEDIAN · 9.7SCORE BUCKET → (0 TO 100)MODELSbenchgecko.ai

Pearson r · original research

Correlation analysis

Benchmarks that track with MUSR

Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.

73 models tested · sorted by score

#ModelScore
1DeepSeek logoDeepSeek R1 Distill Qwen 14B28.7
2nousresearch logoHermes 3 70B Instruct23.4
3Meta logoLlama 3 8B Instruct19.9
4Alibaba Qwen logoQwen2-72B19.7
5
U
Stable Beluga 2
18.6
6Meta logoLlama 3.1 70B Instruct17.7
7
D
Dolphin 2.9.1 Yi 1.5 34b
17.0
8DeepSeek logoR1 Distill Qwen 32B16.1
9Meta logoMeta Llama 3 8B16.0
10Meta logoLlama 3.3 70B Instruct15.6
11OpenAI logoGpt215.3
12Microsoft logoWizardLM-2 8x22B14.5
13z-ai logoGLM 4 32B 14.2
14Microsoft logoPhi 213.8
15Alibaba Qwen logoQwen2.5 Coder 32B Instruct13.7
16Alibaba logoQwen2 VL 7B Instruct13.6
17Alibaba logoQwen2.5 32B Instruct13.5
18anthracite-org logoMagnum v4 72B13.4
19DeepSeek logoR1 Distill Llama 70B13.3
20Microsoft logoPhi 3 Mini 4k Instruct13.1
21
HA
Qwen2.5 72B Instruct Abliterated
12.3
22Alibaba logoQwen2 1.5B Instruct12.0
23Alibaba Qwen logoQwen2.5 72B Instruct11.7
24
L
Vicuna 7b V1.5
11.4
25Google DeepMind logoGemma 2 2b11.3
26nousresearch logoHermes 2 Pro - Llama-3 8B11.3
27
D
Distilgpt2
11.2
28Alibaba Qwen logoQwQ 32B11.1
29Google DeepMind logoGemma 2B11.0
30Mistral AI logoMistral 7B V0.110.7
31eleutherai logoPythia 160m10.7
32Alibaba logoQwen2.5 14B Instruct10.6
33Microsoft logoPhi 410.1
34Microsoft logoPhi 3.5 Mini Instruct10.1
35
HF
SmolLM2 135M
10.0
36Google DeepMind logoGemma 2 9B9.7
37
U
Yi 6B
9.7
38Alibaba Qwen logoQwen2.5 Coder 7B Instruct9.5
39Google DeepMind logoGemma 2 27B9.1
40Meta logoLlama 3.1 8B Instruct8.5
41Alibaba Qwen logoQwen2.5 7B Instruct8.4
42Mistral AI logoMistral 7B Instruct V0.27.6
43Alibaba logoQwen2.5 3B Instruct7.6
44TII logoFalcon-180B7.5
45Alibaba logoQwen2 7B Instruct7.4
46Google DeepMind logoGemma 2 2b It7.1
47Alibaba logoQwen2.5 Coder 14B Instruct7.0
48Microsoft logoPhi 4 Mini Instruct6.5
49OpenAI logoGpt2 Medium6.2
50Mistral AI logoMistral 7B Instruct v0.16.1
51OpenAI logoGpt2 Large5.7
52Alibaba logoQwen2 0.5B4.6
53
T
TinyLlama 1.1B Chat V1.0
4.3
54
U
INTELLECT-1
4.1
55Meta logoLlama 3.2 3B Instruct (free)3.8
56Meta logoLlama 2 7b Hf3.8
57
HF
SmolLM2 135M Instruct
3.7
58DeepSeek logoDeepSeek R1 Distill Qwen 7B3.5
59Microsoft logoPhi-1.53.4
60Meta logoLlama 2 7b Chat Hf3.3
61Alibaba logoQwen2.5 1.5B Instruct3.2
62DeepSeek logoDeepSeek R1 Distill Qwen 1.5B3.0
63
U
StarCoder 2 15B
2.9
64
U
MPT-30B
2.9
65eleutherai logoGpt Neo 125m2.6
66Alibaba logoQwen2 0.5B Instruct2.4
67Meta logoLlama 3.1 405B2.2
68Meta logoLLaMA-13B2.0
69Meta logoLlama 3.2 1B Instruct1.9
70Meta logoMeta Llama 3 8B Instruct1.6
71Meta logoLlama 3.2 3B Instruct1.4
72Alibaba logoQwen2.5 0.5B Instruct1.4
73DeepSeek logoDeepSeek R1 Distill Llama 8B0.5

Pulled from the MUSR dataset · updated daily

What does MUSR measure?

MUSR is a knowledge benchmark in the BenchGecko catalog. 73 AI models have been tested on it. Scores range from 0.5 to 28.7 out of 100.

Which model leads on MUSR?

DeepSeek R1 Distill Qwen 14B from DeepSeek leads MUSR with a score of 28.7. The median score across 73 tested models is 9.7.

Is MUSR saturated?

No · the top score is 28.7 out of 100 (29%). There is still meaningful room for improvement on MUSR.

Does MUSR predict performance on other benchmarks?

Yes · MUSR scores correlate 0.86 with OpenBookQA across 9 shared models. Models that do well on MUSR tend to do well on OpenBookQA.

How often is MUSR data refreshed?

BenchGecko pulls updates daily. New model scores on MUSR appear as soon as they are published by Epoch AI or the model provider.

Same category · related evaluations