Benchmarks/math/Humanity's Last Exam

Humanity's Last Exam

Humanity's Last Exam (HLE) is a multi-modal academic benchmark with 2,500 questions across mathematics, humanities, and natural sciences, designed to test LLM capabilities at the frontier of human knowledge with unambiguous, verifiable solutions

Paper

Progress Over Time

Interactive timeline showing model performance evolution on Humanity's Last Exam

State-of-the-art frontier

Open

Proprietary

Humanity's Last Exam Leaderboard

59 models • 0 verified

			Context	Cost
1	Claude Opus 4.6 Anthropic	—	1.0M	$5.00 $25.00
2	Gemini 3.1 Pro Google	—	1.0M	$2.50 $15.00
3	Kimi K2-Thinking-0905 Moonshot AI	1.0T	—	—
4	Grok-4 Heavy xAI	—	—	—
5	Kimi K2.5 Moonshot AI	1.0T	262K	$0.60 $2.50
6	Claude Sonnet 4.6 Anthropic	—	200K	$3.00 $15.00
7	Qwen3.5-27B Alibaba Cloud / Qwen Team	27B	—	—
8	Qwen3.5-122B-A10B Alibaba Cloud / Qwen Team	122B	262K	$0.40 $3.20
9	Qwen3.5-35B-A3B Alibaba Cloud / Qwen Team	35B	262K	$0.25 $2.00
10	Gemini 3 Pro Google	—	—	—
11	Gemini 3 Flash Google	—	1.0M	$0.50 $3.00
12	GLM-4.7 Zhipu AI	358B	205K	$0.60 $2.20
13	Grok-4 xAI	—	—	—
14	GPT-5.4 OpenAI	—	1.0M	$2.50 $15.00
15	ERNIE 5.0 Baidu	—	—	—
16	GPT-5.2 Pro OpenAI	—	400K	$21.00 $168.00
17	GPT-5.2 OpenAI	—	400K	$1.75 $14.00
18	DeepSeek-V3.2-Speciale DeepSeek	685B	—	—
19	Qwen3.5-397B-A17B Alibaba Cloud / Qwen Team	397B	262K	$0.60 $3.60
20	GPT-5.4 mini OpenAI	—	400K	$0.75 $4.50
21	LongCat-Flash-Thinking-2601 Meituan	560B	128K	$0.30 $1.20
22	DeepSeek-V3.2 (Thinking) DeepSeek	685B	—	—
23	GPT-5 OpenAI	—	400K	$1.25 $10.00
24	GPT-5.4 nano OpenAI	—	400K	$0.20 $1.25
25	Nemotron 3 Super (120B A12B) NVIDIA	120B	262K	$0.10 $0.50
26	MiMo-V2-Flash Xiaomi	309B	256K	$0.10 $0.30
27	MiniMax M2.1 MiniMax	230B	1.0M	$0.30 $1.20
28	Gemini 2.5 Pro Preview 06-05 Google	—	1.0M	$1.25 $10.00
29	Grok 4 Fast xAI	—	2.0M	$0.20 $0.50
30	DeepSeek-V3.2-Exp DeepSeek	685B	—	—
31	Qwen3-235B-A22B-Thinking-2507 Alibaba Cloud / Qwen Team	235B	262K	$0.30 $3.00
32	Gemini 2.5 Pro Google	—	1.0M	$1.25 $10.00
33	DeepSeek-R1-0528 DeepSeek	671B	131K	$0.50 $2.15
34	GLM-4.6 Zhipu AI	357B	131K	$0.55 $2.19
35	GPT-5 mini OpenAI	—	400K	$0.25 $2.00
36	Gemini 3.1 Flash-Lite Google	—	1.0M	$0.25 $1.50
37	DeepSeek-V3.1 DeepSeek	671B	164K	$0.27 $1.00
38	Nemotron 3 Nano (30B A3B) NVIDIA	32B	262K	$0.06 $0.24
39	GPT OSS 120B OpenAI	117B	131K	$0.09 $0.45
40	o3 OpenAI	—	200K	$2.00 $8.00
40	o4-mini OpenAI	—	200K	$1.10 $4.40
42	GLM-4.7-Flash Zhipu AI	30B	128K	$0.07 $0.40
42	GLM-4.5 Zhipu AI	355B	131K	$0.40 $1.60
44	Qwen3 VL 235B A22B Thinking Alibaba Cloud / Qwen Team	236B	262K	$0.45 $3.49
45	MiniMax M2 MiniMax	230B	1.0M	$0.30 $1.20
46	Sarvam-105B Sarvam AI	105B	—	—
47	Gemini 2.5 Flash Google	—	1.0M	$0.30 $2.50
48	GPT OSS 20B OpenAI	21B	131K	$0.05 $0.20
49	GLM-4.5-Air Zhipu AI	106B	—	—
50	Magistral Medium Mistral AI	24B	—	—

Showing 1-50 of 59

1 / 2

Notice missing or incorrect data?

FAQ

Common questions about Humanity's Last Exam

The Humanity's Last Exam paper is available at https://arxiv.org/abs/2501.14249. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.

The Humanity's Last Exam leaderboard ranks 59 AI models based on their performance on this benchmark. Currently, Claude Opus 4.6 by Anthropic leads with a score of 0.531. The average score across all models is 0.242.

The highest Humanity's Last Exam score is 0.531, achieved by Claude Opus 4.6 from Anthropic.

59 models have been evaluated on the Humanity's Last Exam benchmark, with 0 verified results and 59 self-reported results.

Humanity's Last Exam is categorized under math and reasoning. The benchmark evaluates multimodal models.

Humanity's Last Exam

Progress Over Time

Humanity's Last Exam Leaderboard

FAQ

What is the Humanity's Last Exam benchmark?

Where can I find the Humanity's Last Exam paper?

What is the Humanity's Last Exam leaderboard?

What is the highest Humanity's Last Exam score?

How many models are evaluated on Humanity's Last Exam?

What categories does Humanity's Last Exam cover?