Humanity's Last Exam

Progress Over Time

Interactive timeline showing model performance evolution on Humanity's Last Exam

State-of-the-art frontier
Open
Proprietary

Humanity's Last Exam Leaderboard

88 models
ContextCostLicense
1
2
3
41.0M$5.00 / $25.00
5
Anthropic
Anthropic
1.0M$3.00 / $15.00
6
7
ByteDance
ByteDance
81.0M$5.00 / $25.00
8
Zhipu AI
Zhipu AI
753B1.0M$0.95 / $3.00
10
ByteDance
ByteDance
111.0M$5.00 / $25.00
12
Zhipu AI
Zhipu AI
754B200K$1.40 / $4.40
13
OpenAI
OpenAI
1.1M$5.00 / $30.00
141.0M$2.50 / $15.00
151.0T
16
17
Moonshot AI
Moonshot AI
1.0T
18200K$3.00 / $15.00
19
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
27B262K$0.30 / $2.40
201.6T1.0M$1.60 / $3.20
21
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
122B
22
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B
23
24284B1.0M$0.10 / $0.20
251.0M$0.50 / $3.00
26
Zhipu AI
Zhipu AI
358B
27
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
1.0M$1.25 / $3.75
28685B
291.0M$1.50 / $9.00
30
31
OpenAI
OpenAI
1.0M$2.50 / $15.00
32
33550B
34
35
Moonshot AI
Moonshot AI
1.0T262K$0.75 / $3.50
36
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
1.0M$0.32 / $1.28
37
OpenAI
OpenAI
400K$1.75 / $14.00
381.0T1.0M$0.43 / $0.87
39685B
40
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
1.0M$0.50 / $3.00
41
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
397B
42400K$0.75 / $4.50
4331B262K$0.13 / $0.38
44560B
45685B
46
OpenAI
OpenAI
47400K$0.20 / $1.25
48
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
28B262K$0.60 / $3.60
49120B
50309B
150 of 88
1/2
Notice missing or incorrect data?
About this benchmark

What is Humanity's Last Exam?

Humanity's Last Exam (HLE) is a multi-modal academic benchmark with 2,500 questions across mathematics, humanities, and natural sciences, designed to test LLM capabilities at the frontier of human knowledge with unambiguous, verifiable solutions

Humanity's Last Exam is a multimodal benchmark evaluating models on math, reasoning, and vision tasks. LLM Stats tracks 88 models on this benchmark, scored on a 0–1 scale. The current average is 0.3, with the leader at 0.6.

Compare leaders on the best AI for math, best AI for reasoning and best AI for vision leaderboards.

Current leaders

Claude Mythos Preview from Anthropic currently leads the Humanity's Last Exam leaderboard with a score of 0.647 across 88 evaluated AI models.

1Claude Mythos PreviewAnthropic64.7%
2Claude Fable 5Anthropic64.5%
3Muse SparkMeta58.4%
OSSGLM-5.2#8 open-weight54.7%

Source paper

Title
Humanity's Last Exam
Authors
Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, and 1118 others
Published
Abstract

Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.

FAQ

Common questions about the Humanity's Last Exam benchmark and leaderboard.

What is the Humanity's Last Exam benchmark?

Humanity's Last Exam (HLE) is a multi-modal academic benchmark with 2,500 questions across mathematics, humanities, and natural sciences, designed to test LLM capabilities at the frontier of human knowledge with unambiguous, verifiable solutions

What is the Humanity's Last Exam leaderboard?

The Humanity's Last Exam leaderboard ranks 88 AI models based on their performance on this benchmark. Currently, Claude Mythos Preview by Anthropic leads with a score of 0.647. The average score across all models is 0.298.

What is the highest Humanity's Last Exam score?

The highest Humanity's Last Exam score is 0.647, achieved by Claude Mythos Preview from Anthropic.

How many models are evaluated on Humanity's Last Exam?

88 models have been evaluated on the Humanity's Last Exam benchmark, with 0 verified results and 88 self-reported results.

Where can I find the Humanity's Last Exam paper?

The Humanity's Last Exam paper is available at https://arxiv.org/abs/2501.14249. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does Humanity's Last Exam cover?

Humanity's Last Exam is categorized under math, reasoning, and vision. The benchmark evaluates multimodal models.

What is the best open-source model on Humanity's Last Exam?

GLM-5.2 by Zhipu AI is the top-ranked open-source model on Humanity's Last Exam, with a score of 0.547 (rank #8).

How is Humanity's Last Exam scored?

Humanity's Last Exam is scored using accuracy, reported on a 0–1 scale. Lower is better only when explicitly noted; on this leaderboard, higher scores indicate better performance.

How recent are the Humanity's Last Exam leaderboard results?

The Humanity's Last Exam leaderboard was last updated in July 2026 and currently includes 88 evaluated models.