API
Benchmarks/WeirdML

WeirdML

WeirdML β€” tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns.

87
Models Tested
77.9
Top Score
37.8
Average Score
1AnthropicAnthropic77.9
2OpenAIOpenAI72.2
3OpenAIOpenAI72.2
4Google DeepMindGoogle DeepMind72.1
5GoogleGoogle69.9
6AnthropicAnthropic66.1
7AnthropicAnthropic63.7
8Google DeepMindGoogle DeepMind61.6
9OpenAIOpenAI60.8
10OpenAIOpenAI60.8
11OpenAIOpenAI60.7
12OpenAIOpenAI60.7
13OpenAIOpenAI60.4
14OpenAIOpenAI58.2
15OpenAIOpenAI57.4
16Google DeepMindGoogle DeepMind54.0
17OpenAIOpenAI52.7
18OpenAIOpenAI52.6
19OpenAIOpenAI52.4
20
ZA
z-ai
48.2
21
ZA
z-ai
48.2
22OpenAIOpenAI48.2
23OpenAIOpenAI48.2
24OpenAIOpenAI48.2
25OpenAIOpenAI48.2
26AnthropicAnthropic47.7
27OpenAIOpenAI47.6
28AnthropicAnthropic46.1
29xAIxAI45.7
30
M
moonshotai
45.6
31AnthropicAnthropic45.4
32OpenAIOpenAI43.8
33OpenAIOpenAI43.7
34AnthropicAnthropic43.4
35xAIxAI42.9
36
M
moonshotai
42.8
37AnthropicAnthropic42.8
38xAIxAI42.6
39xAIxAI42.6
40DeepSeekDeepSeek41.6
41Alibaba QwenAlibaba Qwen41.2
42Alibaba QwenAlibaba Qwen41.2
43Alibaba QwenAlibaba Qwen41.0
44Google DeepMindGoogle DeepMind41.0
45
ZA
z-ai
40.6
46DeepSeekDeepSeek39.5
47OpenAIOpenAI39.4
48
M
moonshotai
39.4
49OpenAIOpenAI39.0
50AlibabaAlibaba38.7
51DeepSeekDeepSeek38.4
52OpenAIOpenAI38.1
53OpenAIOpenAI37.6
54Alibaba QwenAlibaba Qwen37.3
55xAIxAI37.2
56xAIxAI37.2
57
M
moonshotai
36.7
58DeepSeekDeepSeek36.5
59OpenAIOpenAI36.3
60AnthropicAnthropic31.0
61AnthropicAnthropic30.7
62OpenAIOpenAI25.1
63GoogleGoogle24.9
64MetaMeta24.5
65AnthropicAnthropic23.2
66xAIxAI22.2
67GoogleGoogle22.2
68MetaMeta21.4
69OpenAIOpenAI19.0
70OpenAIOpenAI18.0
71MetaMeta14.4
72MetaMeta14.4
73OpenAIOpenAI12.4
74OpenAIOpenAI12.4
75OpenAIOpenAI12.4
76OpenAIOpenAI12.4
77OpenAIOpenAI11.8
78OpenAIOpenAI11.8
79AnthropicAnthropic10.2
80AnthropicAnthropic9.8
81MetaMeta9.0
82AnthropicAnthropic7.1
83OpenAIOpenAI3.5
84OpenAIOpenAI3.5
85OpenAIOpenAI3.5
86Mistral AIMistral AI3.2
87MetaMeta1.7