What is the Humanity's Last Exam leaderboard?

The Humanity's Last Exam leaderboard ranks 77 AI models based on their performance on this benchmark. Currently, Claude Mythos Preview by Anthropic leads with a score of 0.647. The average score across all models is 0.285.

What is the highest Humanity's Last Exam score?

The highest Humanity's Last Exam score is 0.647, achieved by Claude Mythos Preview from Anthropic.

How many models are evaluated on Humanity's Last Exam?

77 models have been evaluated on the Humanity's Last Exam benchmark, with 0 verified results and 77 self-reported results.

Where can I find the Humanity's Last Exam paper?

The Humanity's Last Exam paper is available at https://arxiv.org/abs/2501.14249. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does Humanity's Last Exam cover?

Humanity's Last Exam is categorized under math, reasoning, and vision. The benchmark evaluates multimodal models.

What is the best open-source model on Humanity's Last Exam?

GLM-5.1 by Zhipu AI is the top-ranked open-source model on Humanity's Last Exam, with a score of 0.523 (rank #7).

How is Humanity's Last Exam scored?

Humanity's Last Exam is scored using accuracy, reported on a 0–1 scale. Lower is better only when explicitly noted; on this leaderboard, higher scores indicate better performance.

How recent are the Humanity's Last Exam leaderboard results?

The Humanity's Last Exam leaderboard was last updated in May 2026 and currently includes 77 evaluated models.

All benchmarks

Humanity's Last Exam

Humanity's Last Exam (HLE) is a multi-modal academic benchmark with 2,500 questions across mathematics, humanities, and natural sciences, designed to test LLM capabilities at the frontier of human knowledge with unambiguous, verifiable solutions

Claude Mythos Preview from Anthropic currently leads the Humanity's Last Exam leaderboard with a score of 0.647 across 77 evaluated AI models.

Paper Dataset Code