Humanity's Last Exam

Name: Humanity's Last Exam Leaderboard — AI Model Scores
Creator: LLM Stats
License: https://llm-stats.com/legal/terms-of-service

Paper Dataset Code

Progress Over Time

Interactive timeline showing model performance evolution on Humanity's Last Exam

State-of-the-art frontier

Open

Proprietary

Humanity's Last Exam Leaderboard

88 models

			Context	Cost
1	Claude Mythos Preview Anthropic	—	—	—
2	Claude Fable 5 Anthropic	—	1.0M	$10.00 / $50.00
3	Muse Spark Meta	—	—	—
4	Claude Opus 4.8 Anthropic	—	1.0M	$5.00 / $25.00
5	Claude Sonnet 5 Anthropic	—	1.0M	$3.00 / $15.00
6	GPT-5.5 Pro OpenAI	—	—	—
7	Seed 2.1 Pro ByteDance	—	—	—
8	GLM-5.2 Zhipu AI	753B	1.0M	$0.95 / $3.00
8	Claude Opus 4.7 Anthropic	—	1.0M	$5.00 / $25.00
10	Seed 2.1 Turbo ByteDance	—	—	—
11	Claude Opus 4.6 Anthropic	—	1.0M	$5.00 / $25.00
12	GLM-5.1 Zhipu AI	754B	200K	$1.40 / $4.40
13	GPT-5.5 OpenAI	—	1.1M	$5.00 / $30.00
14	Gemini 3.1 Pro Google	—	1.0M	$2.50 / $15.00
15	Kimi K2-Thinking-0905 Moonshot AI	1.0T	—	—
16	Grok-4 Heavy xAI	—	—	—
17	Kimi K2.5 Moonshot AI	1.0T	—	—
18	Claude Sonnet 4.6 Anthropic	—	200K	$3.00 / $15.00
19	Qwen3.5-27B Alibaba Cloud / Qwen Team	27B	262K	$0.30 / $2.40
20	DeepSeek-V4-Pro-Max DeepSeek	1.6T	1.0M	$1.60 / $3.20
21	Qwen3.5-122B-A10B Alibaba Cloud / Qwen Team	122B	—	—
22	Qwen3.5-35B-A3B Alibaba Cloud / Qwen Team	35B	—	—
23	Gemini 3 Pro Google	—	—	—
24	DeepSeek-V4-Flash-Max DeepSeek	284B	1.0M	$0.10 / $0.20
25	Gemini 3 Flash Google	—	1.0M	$0.50 / $3.00
26	GLM-4.7 Zhipu AI	358B	—	—
27	Qwen3.7 Max Alibaba Cloud / Qwen Team	—	1.0M	$1.25 / $3.75
28	DeepSeek-V3.2 DeepSeek	685B	—	—
29	Gemini 3.5 Flash Google	—	1.0M	$1.50 / $9.00
30	Grok-4 xAI	—	—	—
31	GPT-5.4 OpenAI	—	1.0M	$2.50 / $15.00
32	ERNIE 5.0 Baidu	—	—	—
33	Nemotron 3 Ultra (550B A55B) NVIDIA	550B	—	—
34	GPT-5.2 Pro OpenAI	—	—	—
35	Kimi K2.6 Moonshot AI	1.0T	262K	$0.75 / $3.50
36	Qwen3.7-Plus Alibaba Cloud / Qwen Team	—	1.0M	$0.32 / $1.28
37	GPT-5.2 OpenAI	—	400K	$1.75 / $14.00
38	MiMo-V2.5-Pro Xiaomi	1.0T	1.0M	$0.43 / $0.87
39	DeepSeek-V3.2-Speciale DeepSeek	685B	—	—
40	Qwen3.6 Plus Alibaba Cloud / Qwen Team	—	1.0M	$0.50 / $3.00
41	Qwen3.5-397B-A17B Alibaba Cloud / Qwen Team	397B	—	—
42	GPT-5.4 mini OpenAI	—	400K	$0.75 / $4.50
43	Gemma 4 31B Google	31B	262K	$0.13 / $0.38
44	LongCat-Flash-Thinking-2601 Meituan	560B	—	—
45	DeepSeek-V3.2 (Thinking) DeepSeek	685B	—	—
46	GPT-5 OpenAI	—	—	—
47	GPT-5.4 nano OpenAI	—	400K	$0.20 / $1.25
48	Qwen3.6-27B Alibaba Cloud / Qwen Team	28B	262K	$0.60 / $3.60
49	Nemotron 3 Super (120B A12B) NVIDIA	120B	—	—
50	MiMo-V2-Flash Xiaomi	309B	—	—

1–50 of 88

1/2

Notice missing or incorrect data?

About this benchmark

What is Humanity's Last Exam?

Humanity's Last Exam (HLE) is a multi-modal academic benchmark with 2,500 questions across mathematics, humanities, and natural sciences, designed to test LLM capabilities at the frontier of human knowledge with unambiguous, verifiable solutions

Humanity's Last Exam is a multimodal benchmark evaluating models on math, reasoning, and vision tasks. LLM Stats tracks 88 models on this benchmark, scored on a 0–1 scale. The current average is 0.3, with the leader at 0.6.

Compare leaders on the best AI for math, best AI for reasoning and best AI for vision leaderboards.

Current leaders

Claude Mythos Preview from Anthropic currently leads the Humanity's Last Exam leaderboard with a score of 0.647 across 88 evaluated AI models.

Claude Mythos PreviewAnthropic64.7%

Claude Fable 5Anthropic64.5%

Muse SparkMeta58.4%

OSS

GLM-5.2#8 open-weight54.7%

Source paper

Title: Humanity's Last Exam
Authors: Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, and 1118 others
Published: January 24, 2025
arXiv: 2501.14249

Abstract

Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.

FAQ

Common questions about the Humanity's Last Exam benchmark and leaderboard.

What is the Humanity's Last Exam benchmark?

What is the Humanity's Last Exam leaderboard?

The Humanity's Last Exam leaderboard ranks 88 AI models based on their performance on this benchmark. Currently, Claude Mythos Preview by Anthropic leads with a score of 0.647. The average score across all models is 0.298.

What is the highest Humanity's Last Exam score?

The highest Humanity's Last Exam score is 0.647, achieved by Claude Mythos Preview from Anthropic.

How many models are evaluated on Humanity's Last Exam?

88 models have been evaluated on the Humanity's Last Exam benchmark, with 0 verified results and 88 self-reported results.

Where can I find the Humanity's Last Exam paper?

The Humanity's Last Exam paper is available at https://arxiv.org/abs/2501.14249. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does Humanity's Last Exam cover?

Humanity's Last Exam is categorized under math, reasoning, and vision. The benchmark evaluates multimodal models.

What is the best open-source model on Humanity's Last Exam?

GLM-5.2 by Zhipu AI is the top-ranked open-source model on Humanity's Last Exam, with a score of 0.547 (rank #8).

Which model offers the best value on Humanity's Last Exam?

Among models scoring within 10% of the leader, Claude Fable 5 from Anthropic is the cheapest, at $10.00 per million input tokens with a score of 0.645.

How is Humanity's Last Exam scored?

Humanity's Last Exam is scored using accuracy, reported on a 0–1 scale. Lower is better only when explicitly noted; on this leaderboard, higher scores indicate better performance.

How recent are the Humanity's Last Exam leaderboard results?

The Humanity's Last Exam leaderboard was last updated in July 2026 and currently includes 88 evaluated models.