Benchmarks/math/Humanity's Last Exam

Humanity's Last Exam

Humanity's Last Exam (HLE) is a multi-modal academic benchmark with 2,500 questions across mathematics, humanities, and natural sciences, designed to test LLM capabilities at the frontier of human knowledge with unambiguous, verifiable solutions

Paper

Progress Over Time

Interactive timeline showing model performance evolution on Humanity's Last Exam

State-of-the-art frontier
Open
Proprietary

Humanity's Last Exam Leaderboard

59 models • 0 verified
ContextCostLicense
1
1.0M
$5.00
$25.00
2
1.0M
$2.50
$15.00
3
1.0T
4
5
Moonshot AI
Moonshot AI
1.0T262K
$0.60
$2.50
6
200K
$3.00
$15.00
7
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
27B
8
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
122B262K
$0.40
$3.20
9
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B262K
$0.25
$2.00
10
11
1.0M
$0.50
$3.00
12
Zhipu AI
Zhipu AI
358B205K
$0.60
$2.20
13
14
OpenAI
OpenAI
1.0M
$2.50
$15.00
15
16
400K
$21.00
$168.00
17
OpenAI
OpenAI
400K
$1.75
$14.00
18
685B
19
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
397B262K
$0.60
$3.60
20
400K
$0.75
$4.50
21
560B128K
$0.30
$1.20
22
685B
23
OpenAI
OpenAI
400K
$1.25
$10.00
24
400K
$0.20
$1.25
25
120B262K
$0.10
$0.50
26
309B256K
$0.10
$0.30
27
230B1.0M
$0.30
$1.20
28
1.0M
$1.25
$10.00
29
2.0M
$0.20
$0.50
30
685B
31
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B262K
$0.30
$3.00
32
1.0M
$1.25
$10.00
33
671B131K
$0.50
$2.15
34
Zhipu AI
Zhipu AI
357B131K
$0.55
$2.19
35
400K
$0.25
$2.00
36
1.0M
$0.25
$1.50
37
671B164K
$0.27
$1.00
38
32B262K
$0.06
$0.24
39
117B131K
$0.09
$0.45
40
OpenAI
OpenAI
200K
$2.00
$8.00
40
OpenAI
OpenAI
200K
$1.10
$4.40
42
Zhipu AI
Zhipu AI
355B131K
$0.40
$1.60
42
30B128K
$0.07
$0.40
44
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B262K
$0.45
$3.49
45
MiniMax
MiniMax
230B1.0M
$0.30
$1.20
46
Sarvam AI
Sarvam AI
105B
47
1.0M
$0.30
$2.50
48
21B131K
$0.05
$0.20
49
Zhipu AI
Zhipu AI
106B
50
24B
Showing 1-50 of 59
1 / 2
Notice missing or incorrect data?

FAQ

Common questions about Humanity's Last Exam

Humanity's Last Exam (HLE) is a multi-modal academic benchmark with 2,500 questions across mathematics, humanities, and natural sciences, designed to test LLM capabilities at the frontier of human knowledge with unambiguous, verifiable solutions
The Humanity's Last Exam paper is available at https://arxiv.org/abs/2501.14249. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The Humanity's Last Exam leaderboard ranks 59 AI models based on their performance on this benchmark. Currently, Claude Opus 4.6 by Anthropic leads with a score of 0.531. The average score across all models is 0.242.
The highest Humanity's Last Exam score is 0.531, achieved by Claude Opus 4.6 from Anthropic.
59 models have been evaluated on the Humanity's Last Exam benchmark, with 0 verified results and 59 self-reported results.
Humanity's Last Exam is categorized under math and reasoning. The benchmark evaluates multimodal models.