Claude 3.5 Sonnet
来自 Anthropic · 发布于 2024-01-01
42.3
平均分
N/A
输入价格
N/A
输出价格
N/A
上下文窗口
text
类型
Tested on 25 benchmarks with 42.3% average. Top scores: Chatbot Arena Elo — Overall (1371.4%), HELM — IFEval (85.6%), Aider — Code Editing (84.2%).
基准测试分数
| 基准测试 | 类别 | 分数 | Bar |
|---|---|---|---|
| Chatbot Arena Elo — Overall | arena | 1371.4 | |
| HELM — IFEval | language | 85.6 | |
| Aider — Code Editing | coding | 84.2 | |
| MMLU | knowledge | 82.0 | |
| Lech Mazur Writing | knowledge | 80.3 | |
| HELM — WildBench | reasoning | 79.2 | |
| HELM — MMLU-Pro | knowledge | 77.7 | |
| GeoBench | knowledge | 62.0 | |
| HELM — GPQA | knowledge | 56.5 | |
| MATH level 5 | math | 51.7 | |
| Aider polyglot | coding | 51.6 | |
| CadEval | coding | 48.0 | |
| VideoMME | multimodal | 46.7 | |
| GPQA diamond | knowledge | 38.7 | |
| Balrog | knowledge | 32.6 | |
| WeirdML | coding | 31.0 | |
| HELM — Omni-MATH | math | 27.6 | |
| The Agent Company | agentic | 24.0 | |
| Cybench | coding | 17.5 | |
| SimpleBench | reasoning | 13.0 | |
| Fortress | safety | 13.0 | |
| OTIS Mock AIME 2024-2025 | math | 6.4 | |
| GSO-Bench | coding | 4.6 | |
| FrontierMath-2025-02-28-Private | math | 1.0 | |
| FrontierMath-Tier-4-2025-07-01-Private | math | 0.1 |
相似模型
Google DeepMind
42.2
Google DeepMind
42.2
Anthropic
42.1
Meta
42.5