Benchmarks/math/Humanity's Last Exam

Humanity's Last Exam

Humanity's Last Exam (HLE) is a multi-modal academic benchmark with 2,500 questions across mathematics, humanities, and natural sciences, designed to test LLM capabilities at the frontier of human knowledge with unambiguous, verifiable solutions

Paper

Progress Over Time

Interactive timeline showing model performance evolution on Humanity's Last Exam

State-of-the-art frontier
Open
Proprietary

Humanity's Last Exam Leaderboard

62 models
ContextCostLicense
11.0M$5.00 / $25.00
21.0M$2.50 / $15.00
31.0T
4
5
Moonshot AI
Moonshot AI
1.0T262K$0.60 / $2.50
6200K$3.00 / $15.00
7
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
27B
8
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
122B262K$0.40 / $3.20
9
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B262K$0.25 / $2.00
10
111.0M$0.50 / $3.00
12
Zhipu AI
Zhipu AI
358B205K$0.60 / $2.20
13
14
OpenAI
OpenAI
1.0M$2.50 / $15.00
15
16400K$21.00 / $168.00
17
OpenAI
OpenAI
400K$1.75 / $14.00
18685B
19
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
20
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
397B262K$0.60 / $3.60
21400K$0.75 / $4.50
22
Google
Google
31B
23560B128K$0.30 / $1.20
24685B
25
OpenAI
OpenAI
400K$1.25 / $10.00
26400K$0.20 / $1.25
27120B262K$0.10 / $0.50
28309B256K$0.10 / $0.30
29230B1.0M$0.30 / $1.20
301.0M$1.25 / $10.00
312.0M$0.20 / $0.50
32685B
33
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B262K$0.30 / $3.00
341.0M$1.25 / $10.00
35671B131K$0.50 / $2.15
3625B
36
Zhipu AI
Zhipu AI
357B131K$0.55 / $2.19
38400K$0.25 / $2.00
391.0M$0.25 / $1.50
40671B164K$0.27 / $1.00
4132B262K$0.06 / $0.24
42117B131K$0.09 / $0.45
43
OpenAI
OpenAI
200K$2.00 / $8.00
43
OpenAI
OpenAI
200K$1.10 / $4.40
4530B128K$0.07 / $0.40
45
Zhipu AI
Zhipu AI
355B131K$0.40 / $1.60
47
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B262K$0.45 / $3.49
48
MiniMax
MiniMax
230B1.0M$0.30 / $1.20
49
Sarvam AI
Sarvam AI
105B
501.0M$0.30 / $2.50
150 of 62
1/2
Notice missing or incorrect data?

FAQ

Common questions about Humanity's Last Exam

Humanity's Last Exam (HLE) is a multi-modal academic benchmark with 2,500 questions across mathematics, humanities, and natural sciences, designed to test LLM capabilities at the frontier of human knowledge with unambiguous, verifiable solutions
The Humanity's Last Exam paper is available at https://arxiv.org/abs/2501.14249. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The Humanity's Last Exam leaderboard ranks 62 AI models based on their performance on this benchmark. Currently, Claude Opus 4.6 by Anthropic leads with a score of 0.531. The average score across all models is 0.242.
The highest Humanity's Last Exam score is 0.531, achieved by Claude Opus 4.6 from Anthropic.
62 models have been evaluated on the Humanity's Last Exam benchmark, with 0 verified results and 62 self-reported results.
Humanity's Last Exam is categorized under math and reasoning. The benchmark evaluates multimodal models.