MathVista
MathVista evaluates mathematical reasoning of foundation models in visual contexts. It consists of 6,141 examples derived from 28 existing multimodal datasets and 3 newly created datasets (IQTest, FunctionQA, and PaperQA), combining challenges from diverse mathematical and visual tasks to assess models' ability to understand complex figures and perform rigorous reasoning.
Progress Over Time
Interactive timeline showing model performance evolution on MathVista
State-of-the-art frontier
Open
Proprietary
MathVista Leaderboard
36 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | OpenAI | — | 200K | $2.00 / $8.00 | ||
| 2 | OpenAI | — | 200K | $1.10 / $4.40 | ||
| 3 | StepFun | 10B | — | — | ||
| 4 | Moonshot AI | — | — | — | ||
| 5 | Meta | 400B | 1.0M | $0.17 / $0.60 | ||
| 6 | OpenAI | — | 1.0M | $0.40 / $1.60 | ||
| 7 | OpenAI | — | 128K | $75.00 / $150.00 | ||
| 8 | OpenAI | — | 1.0M | $2.00 / $8.00 | ||
| 9 | OpenAI | — | 200K | $15.00 / $60.00 | ||
| 10 | Alibaba Cloud / Qwen Team | 73B | — | — | ||
| 11 | Meta | 109B | 10.0M | $0.08 / $0.30 | ||
| 12 | Mistral AI | 124B | 128K | $2.00 / $6.00 | ||
| 13 | xAI | — | 128K | $2.00 / $10.00 | ||
| 14 | xAI | — | — | — | ||
| 14 | Google | — | 2.1M | $2.50 / $10.00 | ||
| 16 | Alibaba Cloud / Qwen Team | 7B | — | — | ||
| 17 | Anthropic | — | 200K | $3.00 / $15.00 | ||
| 18 | Mistral AI | 24B | — | — | ||
| 19 | Google | — | 1.0M | $0.15 / $0.60 | ||
| 20 | OpenAI | — | 128K | $2.50 / $10.00 | ||
| 21 | DeepSeek | 27B | 129K | — | ||
| 22 | Microsoft | 6B | 128K | $0.05 / $0.10 | ||
| 23 | OpenAI | — | 128K | $2.50 / $10.00 | ||
| 24 | DeepSeek | 16B | — | — | ||
| 25 | Mistral AI | 12B | 128K | $0.15 / $0.15 | ||
| 26 | 90B | 128K | $0.35 / $0.40 | |||
| 27 | OpenAI | — | 128K | $0.15 / $0.60 | ||
| 28 | OpenAI | — | 1.0M | $0.10 / $0.40 | ||
| 29 | Google | 8B | 1.0M | $0.07 / $0.30 | ||
| 30 | DeepSeek | 3B | — | — | ||
| 31 | xAI | — | — | — | ||
| 31 | xAI | — | — | — | ||
| 33 | 11B | 128K | $0.05 / $0.05 | |||
| 34 | Google | — | 33K | $0.50 / $1.50 | ||
| 35 | Microsoft | 4B | — | — | ||
| 36 | OpenAI | — | 16K | $0.50 / $1.50 |
Notice missing or incorrect data?
FAQ
Common questions about MathVista
MathVista evaluates mathematical reasoning of foundation models in visual contexts. It consists of 6,141 examples derived from 28 existing multimodal datasets and 3 newly created datasets (IQTest, FunctionQA, and PaperQA), combining challenges from diverse mathematical and visual tasks to assess models' ability to understand complex figures and perform rigorous reasoning.
The MathVista paper is available at https://arxiv.org/abs/2310.02255. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The MathVista leaderboard ranks 36 AI models based on their performance on this benchmark. Currently, o3 by OpenAI leads with a score of 0.868. The average score across all models is 0.632.
The highest MathVista score is 0.868, achieved by o3 from OpenAI.
36 models have been evaluated on the MathVista benchmark, with 0 verified results and 34 self-reported results.
MathVista is categorized under vision, math, and multimodal. The benchmark evaluates multimodal models.