AGIEval
A human-centric benchmark for evaluating foundation models on standardized exams including college entrance exams (Gaokao, SAT), law school admission tests (LSAT), math competitions, lawyer qualification tests, and civil service exams. Contains 20 tasks (18 multiple-choice, 2 cloze) designed to assess understanding, knowledge, reasoning, and calculation abilities in real-world academic and professional contexts.
Progress Over Time
Interactive timeline showing model performance evolution on AGIEval
State-of-the-art frontier
Open
Proprietary
AGIEval Leaderboard
10 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Mistral AI | 24B | — | — | ||
| 2 | Mistral AI | 14B | — | — | ||
| 3 | Mistral AI | 8B | — | — | ||
| 4 | Nous Research | 70B | — | — | ||
| 5 | Google | 27B | — | — | ||
| 6 | Google | 9B | — | — | ||
| 7 | Mistral AI | 3B | — | — | ||
| 8 | 8B | — | — | |||
| 9 | Mistral AI | 8B | 128K | $0.10 / $0.10 | ||
| 10 | Baidu | 21B | 128K | $0.40 / $4.00 |
Notice missing or incorrect data?
FAQ
Common questions about AGIEval
A human-centric benchmark for evaluating foundation models on standardized exams including college entrance exams (Gaokao, SAT), law school admission tests (LSAT), math competitions, lawyer qualification tests, and civil service exams. Contains 20 tasks (18 multiple-choice, 2 cloze) designed to assess understanding, knowledge, reasoning, and calculation abilities in real-world academic and professional contexts.
The AGIEval paper is available at https://arxiv.org/abs/2304.06364. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The AGIEval dataset is available at https://github.com/ruixiangcui/AGIEval.
The AGIEval leaderboard ranks 10 AI models based on their performance on this benchmark. Currently, Mistral Small 3 24B Base by Mistral AI leads with a score of 0.658. The average score across all models is 0.531.
The highest AGIEval score is 0.658, achieved by Mistral Small 3 24B Base from Mistral AI.
10 models have been evaluated on the AGIEval benchmark, with 0 verified results and 10 self-reported results.
AGIEval is categorized under general, legal, math, and reasoning. The benchmark evaluates text models.