AGIEval

A human-centric benchmark for evaluating foundation models on standardized exams including college entrance exams (Gaokao, SAT), law school admission tests (LSAT), math competitions, lawyer qualification tests, and civil service exams. Contains 20 tasks (18 multiple-choice, 2 cloze) designed to assess understanding, knowledge, reasoning, and calculation abilities in real-world academic and professional contexts.

PaperImplementation

Progress Over Time

Interactive timeline showing model performance evolution on AGIEval

State-of-the-art frontier
Open
Proprietary

AGIEval Leaderboard

10 models
ContextCostLicense
124B
214B
38B
4
Nous Research
Nous Research
70B
527B
69B
73B
88B
98B128K$0.10 / $0.10
1021B128K$0.40 / $4.00
Notice missing or incorrect data?

FAQ

Common questions about AGIEval

A human-centric benchmark for evaluating foundation models on standardized exams including college entrance exams (Gaokao, SAT), law school admission tests (LSAT), math competitions, lawyer qualification tests, and civil service exams. Contains 20 tasks (18 multiple-choice, 2 cloze) designed to assess understanding, knowledge, reasoning, and calculation abilities in real-world academic and professional contexts.
The AGIEval paper is available at https://arxiv.org/abs/2304.06364. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The AGIEval dataset is available at https://github.com/ruixiangcui/AGIEval.
The AGIEval leaderboard ranks 10 AI models based on their performance on this benchmark. Currently, Mistral Small 3 24B Base by Mistral AI leads with a score of 0.658. The average score across all models is 0.531.
The highest AGIEval score is 0.658, achieved by Mistral Small 3 24B Base from Mistral AI.
10 models have been evaluated on the AGIEval benchmark, with 0 verified results and 10 self-reported results.
AGIEval is categorized under general, legal, math, and reasoning. The benchmark evaluates text models.