GDPval-MM

Paper

Progress Over Time

Interactive timeline showing model performance evolution on GDPval-MM

State-of-the-art frontier
Open
Proprietary

GDPval-MM Leaderboard

3 models
ContextCostLicense
1
OpenAI
OpenAI
1.1M$5.00 / $30.00
2
3230B1.0M$0.30 / $1.20
Notice missing or incorrect data?
About this benchmark

What is GDPval-MM?

GDPval-MM is the multimodal variant of the GDPval benchmark, evaluating AI model performance on real-world economically valuable tasks that require processing and generating multimodal content including documents, slides, diagrams, spreadsheets, images, and other professional deliverables across diverse industries.

GDPval-MM is a multimodal benchmark evaluating models on multimodal, reasoning, finance, and general tasks. LLM Stats tracks 3 models on this benchmark, scored on a 0–1 scale. The current average is 0.8, with the leader at 0.8.

Compare leaders on the best AI for multimodal, best AI for reasoning, best AI for finance and best AI for general leaderboards.

Current leaders

GPT-5.5 from OpenAI currently leads the GDPval-MM leaderboard with a score of 0.849 across 3 evaluated AI models.

1GPT-5.5OpenAI84.9%
2GPT-5.5 ProOpenAI82.3%
3MiniMax M2.5MiniMax59.0%

Source paper

Title
GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks
Authors
Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, and 15 others
Published
Abstract

We introduce GDPval, a benchmark evaluating AI model capabilities on real-world economically valuable tasks. GDPval covers the majority of U.S. Bureau of Labor Statistics Work Activities for 44 occupations across the top 9 sectors contributing to U.S. GDP (Gross Domestic Product). Tasks are constructed from the representative work of industry professionals with an average of 14 years of experience. We find that frontier model performance on GDPval is improving roughly linearly over time, and that the current best frontier models are approaching industry experts in deliverable quality. We analyze the potential for frontier models, when paired with human oversight, to perform GDPval tasks cheaper and faster than unaided experts. We also demonstrate that increased reasoning effort, increased task context, and increased scaffolding improves model performance on GDPval. Finally, we open-source a gold subset of 220 tasks and provide a public automated grading service at evals.openai.com to facilitate future research in understanding real-world model capabilities.

FAQ

Common questions about the GDPval-MM benchmark and leaderboard.

What is the GDPval-MM benchmark?

GDPval-MM is the multimodal variant of the GDPval benchmark, evaluating AI model performance on real-world economically valuable tasks that require processing and generating multimodal content including documents, slides, diagrams, spreadsheets, images, and other professional deliverables across diverse industries.

What is the GDPval-MM leaderboard?

The GDPval-MM leaderboard ranks 3 AI models based on their performance on this benchmark. Currently, GPT-5.5 by OpenAI leads with a score of 0.849. The average score across all models is 0.754.

What is the highest GDPval-MM score?

The highest GDPval-MM score is 0.849, achieved by GPT-5.5 from OpenAI.

How many models are evaluated on GDPval-MM?

3 models have been evaluated on the GDPval-MM benchmark, with 0 verified results and 3 self-reported results.

Where can I find the GDPval-MM paper?

The GDPval-MM paper is available at https://arxiv.org/abs/2510.04374. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does GDPval-MM cover?

GDPval-MM is categorized under multimodal, reasoning, finance, and general. The benchmark evaluates multimodal models.

What's the difference between GDPval-MM and GDPval-AA?

GDPval-MM is a variant of GDPval-AA. See the GDPval-AA leaderboard for the broader benchmark and per-model comparison.

What is the best open-source model on GDPval-MM?

MiniMax M2.5 by MiniMax is the top-ranked open-source model on GDPval-MM, with a score of 0.590 (rank #3).

Which model offers the best value on GDPval-MM?

Among models scoring within 10% of the leader, GPT-5.5 from OpenAI is the cheapest, at $5.00 per million input tokens with a score of 0.849.

How recent are the GDPval-MM leaderboard results?

The GDPval-MM leaderboard was last updated in June 2026 and currently includes 3 evaluated models.