SAT Math

Paper

Progress Over Time

Interactive timeline showing model performance evolution on SAT Math

State-of-the-art frontier
Open
Proprietary

SAT Math Leaderboard

1 models
ContextCostLicense
1
OpenAI
OpenAI
Notice missing or incorrect data?
About this benchmark

What is SAT Math?

SAT Math benchmark from AGIEval containing standardized mathematics questions from the College Board SAT examination, designed to evaluate mathematical reasoning capabilities of foundation models using human-centric assessment methods.

SAT Math is a text benchmark evaluating models on math and reasoning tasks. LLM Stats tracks 1 models on this benchmark, scored on a 0–1 scale. The current average is 0.9, with the leader at 0.9.

Compare leaders on the best AI for math and best AI for reasoning leaderboards.

Current leaders

GPT-4 from OpenAI currently leads the SAT Math leaderboard with a score of 0.890 across 1 evaluated AI models.

1GPT-4OpenAI89.0%

Source paper

Title
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
Authors
Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, and 5 others
Published
Abstract

Evaluating the general abilities of foundation models to tackle human-level tasks is a vital aspect of their development and application in the pursuit of Artificial General Intelligence (AGI). Traditional benchmarks, which rely on artificial datasets, may not accurately represent human-level capabilities. In this paper, we introduce AGIEval, a novel benchmark specifically designed to assess foundation model in the context of human-centric standardized exams, such as college entrance exams, law school admission tests, math competitions, and lawyer qualification tests. We evaluate several state-of-the-art foundation models, including GPT-4, ChatGPT, and Text-Davinci-003, using this benchmark. Impressively, GPT-4 surpasses average human performance on SAT, LSAT, and math competitions, attaining a 95% accuracy rate on the SAT Math test and a 92.5% accuracy on the English test of the Chinese national college entrance exam. This demonstrates the extraordinary performance of contemporary foundation models. In contrast, we also find that GPT-4 is less proficient in tasks that require complex reasoning or specific domain knowledge. Our comprehensive analyses of model capabilities (understanding, knowledge, reasoning, and calculation) reveal these models' strengths and limitations, providing valuable insights into future directions for enhancing their general capabilities. By concentrating on tasks pertinent to human cognition and decision-making, our benchmark delivers a more meaningful and robust evaluation of foundation models' performance in real-world scenarios. The data, code, and all model outputs are released in https://github.com/ruixiangcui/AGIEval.

FAQ

Common questions about the SAT Math benchmark and leaderboard.

What is the SAT Math benchmark?

SAT Math benchmark from AGIEval containing standardized mathematics questions from the College Board SAT examination, designed to evaluate mathematical reasoning capabilities of foundation models using human-centric assessment methods.

What is the SAT Math leaderboard?

The SAT Math leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, GPT-4 by OpenAI leads with a score of 0.890. The average score across all models is 0.890.

What is the highest SAT Math score?

The highest SAT Math score is 0.890, achieved by GPT-4 from OpenAI.

How many models are evaluated on SAT Math?

1 models have been evaluated on the SAT Math benchmark, with 0 verified results and 1 self-reported results.

Where can I find the SAT Math paper?

The SAT Math paper is available at https://arxiv.org/abs/2304.06364. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does SAT Math cover?

SAT Math is categorized under math and reasoning. The benchmark evaluates text models.

How recent are the SAT Math leaderboard results?

The SAT Math leaderboard was last updated in July 2026 and currently includes 1 evaluated models.