LIVETracking 971 AI models from 268 providers.

Models971·Providers268·Benchmarks128·Companies71·Agents165·TopQwen3 VL 235B A22B Instruct · 1415.8%·Updated1h·Data Points2,902·MCP Servers4,923

Benchmark · Reasoning

ANLI

ANLI (Adversarial NLI) · adversarially constructed natural language inference dataset where each round targets weaknesses found in previous model generations.

Updated 2024-04-18

Models tested

9

Top score

37.1

GPT-3.5 Turbo (older v0613)

Median

32.8

min 13.8

Top-5 spread

σ 1.8

settled

The Frontier

Best score over time · one chart, every benchmark

Chart type

Only 1 models have been tested on ANLI · not enough history to compute a frontier yet.

Pink dots = frontier records · 0 totalClick to open model page

Distribution

Where models cluster

Correlated benchmarks

Pearson r · original research

Correlation analysis

Benchmarks that track with ANLI

Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.

WinograndeReasoning

OpenBookQAKnowledge

HellaSwagReasoning

TriviaQAKnowledge

Full rankings

9 models tested · sorted by score

#	Model	Score	Price	Bar
1	GPT-3.5 Turbo (older v0613)· OpenAI	37.1	$1.00
2	phi-3-small 7.4B· Microsoft	37.1	—
3	Llama 3 8B Instruct· Meta	36.0	$0.03
4	phi-3-medium 14B· Microsoft	33.7	—
5	Mixtral 8x7B Instruct· Mistral AI	32.8	$0.54
6	phi-3-mini 3.8B· Microsoft	29.2	—
7	Gemma 2B· Google DeepMind	23.1	—
8	Mistral 7B V0.1· Mistral AI	20.6	—
9	Phi 2· Microsoft	13.8	—

Frequently asked

Pulled from the ANLI dataset · updated daily

What does ANLI measure?

ANLI is a reasoning benchmark in the BenchGecko catalog. 9 AI models have been tested on it. Scores range from 13.8 to 37.1 out of 100.

Which model leads on ANLI?

GPT-3.5 Turbo (older v0613) from OpenAI leads ANLI with a score of 37.1. The median score across 9 tested models is 32.8.

Is ANLI saturated?

No · the top score is 37.1 out of 100 (37%). There is still meaningful room for improvement on ANLI.

Does ANLI predict performance on other benchmarks?

Yes · ANLI scores correlate 0.87 with Winogrande across 9 shared models. Models that do well on ANLI tend to do well on Winogrande.

How often is ANLI data refreshed?

BenchGecko pulls updates daily. New model scores on ANLI appear as soon as they are published by Epoch AI or the model provider.

Top on ANLI

GPT-3.5 Turbo (older v0613) · 37.1 phi-3-small 7.4B · 37.1 Llama 3 8B Instruct · 36.0 phi-3-medium 14B · 33.7 Mixtral 8x7B Instruct · 32.8

Related topics

ANLI · Glossary Reasoning category All benchmarks Model leaderboard Methodology

Compare models

GPT-3.5 Turbo (older v0613) vs phi-3-small 7.4B phi-3-small 7.4B vs Llama 3 8B Instruct Llama 3 8B Instruct vs phi-3-medium 14B phi-3-medium 14B vs Mixtral 8x7B Instruct

More reasoning benchmarks

Same category · related evaluations