DeepResearch Bench
DeepResearch Bench · evaluates AI on complex multi-step research tasks requiring information gathering, synthesis, and producing comprehensive analyses.
The Frontier
Best score over time · one chart, every benchmark
Full rankings
13 models tested · sorted by score
| # | Model | Score |
|---|---|---|
| 1 | 55.1 | |
| 2 | 52.6 | |
| 3 | 49.7 | |
| 4 | 49.7 | |
| 5 | 49.0 | |
| 6 | 47.9 | |
| 7 | 47.8 | |
| 8 | 46.6 | |
| 9 | 43.6 | |
| 10 | 35.1 | |
| 11 | 35.1 | |
| 12 | 29.3 | |
| 13 | 29.2 |
Score distribution
Where models cluster
Correlated benchmarks
Pearson r · original research
Benchmarks that track with DeepResearch Bench
Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.
Frequently asked
About DeepResearch Bench
What does DeepResearch Bench measure?
DeepResearch Bench · evaluates AI on complex multi-step research tasks requiring information gathering, synthesis, and producing comprehensive analyses. 13 AI models have been tested on it. Scores range from 29.2 to 55.1 out of 100.
Which model leads on DeepResearch Bench?
GPT-5 from OpenAI leads DeepResearch Bench with a score of 55.1. The median score across 13 tested models is 47.8.
Is DeepResearch Bench saturated?
No · the top score is 55.1 out of 100 (55%). There is still meaningful room for improvement on DeepResearch Bench.
Does DeepResearch Bench predict performance on other benchmarks?
Yes · DeepResearch Bench scores correlate 0.96 with Cybench across 6 shared models. Models that do well on DeepResearch Bench tend to do well on Cybench.
How often is DeepResearch Bench data refreshed?
BenchGecko pulls updates daily. New model scores on DeepResearch Bench appear as soon as they are published by Epoch AI or the model provider.
- Category
- Knowledge
- Max score
- 100
- Models
- 13
- Updated
- 2025-09-29
Top on DeepResearch Bench
GPT-5 · 55.1Claude Sonnet 4.5 · 52.6Gemini 2.5 Pro · 49.7Claude Opus 4.1 · 49.7Claude Opus 4 · 49.0More knowledge benchmarks
Same category · related evaluations