AGIEval

PaperImplementation

Progress Over Time

Interactive timeline showing model performance evolution on AGIEval

State-of-the-art frontier
Open
Proprietary

AGIEval Leaderboard

10 models
ContextCostLicense
124B
214B
38B
4
Nous Research
Nous Research
70B
527B
69B
73B
88B
98B
1021B
Notice missing or incorrect data?
About this benchmark

What is AGIEval?

A human-centric benchmark for evaluating foundation models on standardized exams including college entrance exams (Gaokao, SAT), law school admission tests (LSAT), math competitions, lawyer qualification tests, and civil service exams. Contains 20 tasks (18 multiple-choice, 2 cloze) designed to assess understanding, knowledge, reasoning, and calculation abilities in real-world academic and professional contexts.

AGIEval is a text benchmark evaluating models on legal, math, reasoning, and general tasks. LLM Stats tracks 10 models on this benchmark, scored on a 0–1 scale. The current average is 0.5, with the leader at 0.7.

Compare leaders on the best AI for legal, best AI for math, best AI for reasoning and best AI for general leaderboards.

Current leaders

Mistral Small 3 24B Base from Mistral AI currently leads the AGIEval leaderboard with a score of 0.658 across 10 evaluated AI models.

Source paper

Title
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
Authors
Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, and 5 others
Published
Abstract

Evaluating the general abilities of foundation models to tackle human-level tasks is a vital aspect of their development and application in the pursuit of Artificial General Intelligence (AGI). Traditional benchmarks, which rely on artificial datasets, may not accurately represent human-level capabilities. In this paper, we introduce AGIEval, a novel benchmark specifically designed to assess foundation model in the context of human-centric standardized exams, such as college entrance exams, law school admission tests, math competitions, and lawyer qualification tests. We evaluate several state-of-the-art foundation models, including GPT-4, ChatGPT, and Text-Davinci-003, using this benchmark. Impressively, GPT-4 surpasses average human performance on SAT, LSAT, and math competitions, attaining a 95% accuracy rate on the SAT Math test and a 92.5% accuracy on the English test of the Chinese national college entrance exam. This demonstrates the extraordinary performance of contemporary foundation models. In contrast, we also find that GPT-4 is less proficient in tasks that require complex reasoning or specific domain knowledge. Our comprehensive analyses of model capabilities (understanding, knowledge, reasoning, and calculation) reveal these models' strengths and limitations, providing valuable insights into future directions for enhancing their general capabilities. By concentrating on tasks pertinent to human cognition and decision-making, our benchmark delivers a more meaningful and robust evaluation of foundation models' performance in real-world scenarios. The data, code, and all model outputs are released in https://github.com/ruixiangcui/AGIEval.

FAQ

Common questions about the AGIEval benchmark and leaderboard.

What is the AGIEval benchmark?

A human-centric benchmark for evaluating foundation models on standardized exams including college entrance exams (Gaokao, SAT), law school admission tests (LSAT), math competitions, lawyer qualification tests, and civil service exams. Contains 20 tasks (18 multiple-choice, 2 cloze) designed to assess understanding, knowledge, reasoning, and calculation abilities in real-world academic and professional contexts.

What is the AGIEval leaderboard?

The AGIEval leaderboard ranks 10 AI models based on their performance on this benchmark. Currently, Mistral Small 3 24B Base by Mistral AI leads with a score of 0.658. The average score across all models is 0.531.

What is the highest AGIEval score?

The highest AGIEval score is 0.658, achieved by Mistral Small 3 24B Base from Mistral AI.

How many models are evaluated on AGIEval?

10 models have been evaluated on the AGIEval benchmark, with 0 verified results and 10 self-reported results.

Where can I find the AGIEval paper?

The AGIEval paper is available at https://arxiv.org/abs/2304.06364. The paper details the methodology, dataset construction, and evaluation criteria.

Where can I find the AGIEval dataset?

The AGIEval dataset is available at https://github.com/ruixiangcui/AGIEval.

What categories does AGIEval cover?

AGIEval is categorized under legal, math, reasoning, and general. The benchmark evaluates text models.

What is the best open-source model on AGIEval?

Mistral Small 3 24B Base by Mistral AI is the top-ranked open-source model on AGIEval, with a score of 0.658 (rank #1).

How recent are the AGIEval leaderboard results?

The AGIEval leaderboard was last updated in July 2026 and currently includes 10 evaluated models.