Humanity's Last Exam
Humanity's Last Exam (HLE) is a multi-modal academic benchmark with 2,500 questions across mathematics, humanities, and natural sciences, designed to test LLM capabilities at the frontier of human knowledge with unambiguous, verifiable solutions
Progress Over Time
Interactive timeline showing model performance evolution on Humanity's Last Exam
State-of-the-art frontier
Open
Proprietary
Humanity's Last Exam Leaderboard
59 models • 0 verified
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
1 | Anthropic | — | 1.0M | $5.00 $25.00 | ||
2 | Google | — | 1.0M | $2.50 $15.00 | ||
3 | Moonshot AI | 1.0T | — | — | ||
4 | xAI | — | — | — | ||
5 | Moonshot AI | 1.0T | 262K | $0.60 $2.50 | ||
6 | Anthropic | — | 200K | $3.00 $15.00 | ||
7 | Alibaba Cloud / Qwen Team | 27B | — | — | ||
8 | Alibaba Cloud / Qwen Team | 122B | 262K | $0.40 $3.20 | ||
9 | Alibaba Cloud / Qwen Team | 35B | 262K | $0.25 $2.00 | ||
10 | Google | — | — | — | ||
11 | Google | — | 1.0M | $0.50 $3.00 | ||
12 | Zhipu AI | 358B | 205K | $0.60 $2.20 | ||
13 | xAI | — | — | — | ||
14 | OpenAI | — | 1.0M | $2.50 $15.00 | ||
15 | Baidu | — | — | — | ||
16 | OpenAI | — | 400K | $21.00 $168.00 | ||
17 | OpenAI | — | 400K | $1.75 $14.00 | ||
18 | DeepSeek | 685B | — | — | ||
19 | Alibaba Cloud / Qwen Team | 397B | 262K | $0.60 $3.60 | ||
20 | OpenAI | — | 400K | $0.75 $4.50 | ||
21 | Meituan | 560B | 128K | $0.30 $1.20 | ||
22 | DeepSeek | 685B | — | — | ||
23 | OpenAI | — | 400K | $1.25 $10.00 | ||
24 | OpenAI | — | 400K | $0.20 $1.25 | ||
25 | 120B | 262K | $0.10 $0.50 | |||
26 | Xiaomi | 309B | 256K | $0.10 $0.30 | ||
27 | MiniMax | 230B | 1.0M | $0.30 $1.20 | ||
28 | — | 1.0M | $1.25 $10.00 | |||
29 | xAI | — | 2.0M | $0.20 $0.50 | ||
30 | DeepSeek | 685B | — | — | ||
31 | Alibaba Cloud / Qwen Team | 235B | 262K | $0.30 $3.00 | ||
32 | Google | — | 1.0M | $1.25 $10.00 | ||
33 | DeepSeek | 671B | 131K | $0.50 $2.15 | ||
34 | Zhipu AI | 357B | 131K | $0.55 $2.19 | ||
35 | OpenAI | — | 400K | $0.25 $2.00 | ||
36 | Google | — | 1.0M | $0.25 $1.50 | ||
37 | DeepSeek | 671B | 164K | $0.27 $1.00 | ||
38 | 32B | 262K | $0.06 $0.24 | |||
39 | OpenAI | 117B | 131K | $0.09 $0.45 | ||
40 | OpenAI | — | 200K | $2.00 $8.00 | ||
40 | OpenAI | — | 200K | $1.10 $4.40 | ||
42 | Zhipu AI | 30B | 128K | $0.07 $0.40 | ||
42 | Zhipu AI | 355B | 131K | $0.40 $1.60 | ||
44 | Alibaba Cloud / Qwen Team | 236B | 262K | $0.45 $3.49 | ||
45 | MiniMax | 230B | 1.0M | $0.30 $1.20 | ||
46 | Sarvam AI | 105B | — | — | ||
47 | Google | — | 1.0M | $0.30 $2.50 | ||
48 | OpenAI | 21B | 131K | $0.05 $0.20 | ||
49 | Zhipu AI | 106B | — | — | ||
50 | Mistral AI | 24B | — | — |
Showing 1-50 of 59
1 / 2
Notice missing or incorrect data?
FAQ
Common questions about Humanity's Last Exam
Humanity's Last Exam (HLE) is a multi-modal academic benchmark with 2,500 questions across mathematics, humanities, and natural sciences, designed to test LLM capabilities at the frontier of human knowledge with unambiguous, verifiable solutions
The Humanity's Last Exam paper is available at https://arxiv.org/abs/2501.14249. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The Humanity's Last Exam leaderboard ranks 59 AI models based on their performance on this benchmark. Currently, Claude Opus 4.6 by Anthropic leads with a score of 0.531. The average score across all models is 0.242.
The highest Humanity's Last Exam score is 0.531, achieved by Claude Opus 4.6 from Anthropic.
59 models have been evaluated on the Humanity's Last Exam benchmark, with 0 verified results and 59 self-reported results.
Humanity's Last Exam is categorized under math and reasoning. The benchmark evaluates multimodal models.