WeirdML
WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns.
The Frontier
Best score over time · one chart, every benchmark
Full rankings
70 models tested · sorted by score
Score distribution
Where models cluster
Correlated benchmarks
Pearson r · original research
Benchmarks that track with WeirdML
Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.
Frequently asked
About WeirdML
What does WeirdML measure?
WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns. 70 AI models have been tested on it. Scores range from 1.7 to 79.3 out of 100.
Which model leads on WeirdML?
GPT-5.3-Codex from OpenAI leads WeirdML with a score of 79.3. The median score across 70 tested models is 40.2.
Is WeirdML saturated?
No · the top score is 79.3 out of 100 (79%). There is still meaningful room for improvement on WeirdML.
Does WeirdML predict performance on other benchmarks?
Yes · WeirdML scores correlate 0.93 with BBH across 5 shared models. Models that do well on WeirdML tend to do well on BBH.
How often is WeirdML data refreshed?
BenchGecko pulls updates daily. New model scores on WeirdML appear as soon as they are published by Epoch AI or the model provider.
- Category
- Code
- Max score
- 100
- Models
- 70
- Updated
- 2026-03-05
Top on WeirdML
GPT-5.3-Codex · 79.3Claude Opus 4.6 · 77.9GPT-5.2 · 72.2Gemini 3.1 Pro Preview · 72.1Gemini 3 Pro · 69.9More code benchmarks
Same category · related evaluations