ANLI
ANLI (Adversarial NLI) · adversarially constructed natural language inference dataset where each round targets weaknesses found in previous model generations.
The Frontier
Best score over time · one chart, every benchmark
Full rankings
9 models tested · sorted by score
| # | Model | Score |
|---|---|---|
| 1 | 37.1 | |
| 2 | 37.1 | |
| 3 | 36.0 | |
| 4 | 33.7 | |
| 5 | 32.8 | |
| 6 | 29.2 | |
| 7 | 23.1 | |
| 8 | 20.6 | |
| 9 | 13.8 |
Score distribution
Where models cluster
Correlated benchmarks
Pearson r · original research
Benchmarks that track with ANLI
Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.
Frequently asked
About ANLI
What does ANLI measure?
ANLI (Adversarial NLI) · adversarially constructed natural language inference dataset where each round targets weaknesses found in previous model generations. 9 AI models have been tested on it. Scores range from 13.8 to 37.1 out of 100.
Which model leads on ANLI?
GPT-3.5 Turbo (older v0613) from OpenAI leads ANLI with a score of 37.1. The median score across 9 tested models is 32.8.
Is ANLI saturated?
No · the top score is 37.1 out of 100 (37%). There is still meaningful room for improvement on ANLI.
Does ANLI predict performance on other benchmarks?
Yes · ANLI scores correlate 0.87 with Winogrande across 9 shared models. Models that do well on ANLI tend to do well on Winogrande.
How often is ANLI data refreshed?
BenchGecko pulls updates daily. New model scores on ANLI appear as soon as they are published by Epoch AI or the model provider.
- Category
- Reasoning
- Max score
- 100
- Models
- 9
- Updated
- 2024-04-18
Top on ANLI
GPT-3.5 Turbo (older v0613) · 37.1phi-3-small 7.4B · 37.1Llama 3 8B Instruct · 36.0phi-3-medium 14B · 33.7Mixtral 8x7B Instruct · 32.8More reasoning benchmarks
Same category · related evaluations