Benchmarks/vision/MathVista

MathVista

MathVista evaluates mathematical reasoning of foundation models in visual contexts. It consists of 6,141 examples derived from 28 existing multimodal datasets and 3 newly created datasets (IQTest, FunctionQA, and PaperQA), combining challenges from diverse mathematical and visual tasks to assess models' ability to understand complex figures and perform rigorous reasoning.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on MathVista

State-of-the-art frontier
Open
Proprietary

MathVista Leaderboard

36 models
ContextCostLicense
1
OpenAI
OpenAI
200K$2.00 / $8.00
2
OpenAI
OpenAI
200K$1.10 / $4.40
310B
4
Moonshot AI
Moonshot AI
5400B1.0M$0.17 / $0.60
61.0M$0.40 / $1.60
7
OpenAI
OpenAI
128K$75.00 / $150.00
8
OpenAI
OpenAI
1.0M$2.00 / $8.00
9
OpenAI
OpenAI
200K$15.00 / $60.00
10
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
73B
11109B10.0M$0.08 / $0.30
12
Mistral AI
Mistral AI
124B128K$2.00 / $6.00
13128K$2.00 / $10.00
14
142.1M$2.50 / $10.00
16
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
7B
17200K$3.00 / $15.00
1824B
191.0M$0.15 / $0.60
20
OpenAI
OpenAI
128K$2.50 / $10.00
21
DeepSeek
DeepSeek
27B129K
226B128K$0.05 / $0.10
23
OpenAI
OpenAI
128K$2.50 / $10.00
2416B
25
Mistral AI
Mistral AI
12B128K$0.15 / $0.15
2690B128K$0.35 / $0.40
27128K$0.15 / $0.60
281.0M$0.10 / $0.40
298B1.0M$0.07 / $0.30
303B
31
31
3311B128K$0.05 / $0.05
3433K$0.50 / $1.50
354B
3616K$0.50 / $1.50
Notice missing or incorrect data?

FAQ

Common questions about MathVista

MathVista evaluates mathematical reasoning of foundation models in visual contexts. It consists of 6,141 examples derived from 28 existing multimodal datasets and 3 newly created datasets (IQTest, FunctionQA, and PaperQA), combining challenges from diverse mathematical and visual tasks to assess models' ability to understand complex figures and perform rigorous reasoning.
The MathVista paper is available at https://arxiv.org/abs/2310.02255. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The MathVista leaderboard ranks 36 AI models based on their performance on this benchmark. Currently, o3 by OpenAI leads with a score of 0.868. The average score across all models is 0.632.
The highest MathVista score is 0.868, achieved by o3 from OpenAI.
36 models have been evaluated on the MathVista benchmark, with 0 verified results and 34 self-reported results.
MathVista is categorized under vision, math, and multimodal. The benchmark evaluates multimodal models.