MathVista

Paper

Progress Over Time

Interactive timeline showing model performance evolution on MathVista

State-of-the-art frontier
Open
Proprietary

MathVista Leaderboard

39 models
ContextCostLicense
1
ByteDance
ByteDance
2
3
OpenAI
OpenAI
4
OpenAI
OpenAI
510B
6218B
7
Moonshot AI
Moonshot AI
8400B
91.0M$0.40 / $1.60
10
OpenAI
OpenAI
11
OpenAI
OpenAI
1.0M$2.00 / $8.00
12
OpenAI
OpenAI
13
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
73B
14109B
15
Mistral AI
Mistral AI
124B
16
17
17
19
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
7B
20
2124B
22
23
OpenAI
OpenAI
128K$2.50 / $10.00
24
DeepSeek
DeepSeek
27B
256B
26
OpenAI
OpenAI
128K$2.50 / $10.00
2716B
28
Mistral AI
Mistral AI
12B
2990B
30
311.0M$0.10 / $0.40
328B
333B
34
34
3611B
37
384B
3916K$0.50 / $1.50
Notice missing or incorrect data?
About this benchmark

What is MathVista?

MathVista evaluates mathematical reasoning of foundation models in visual contexts. It consists of 6,141 examples derived from 28 existing multimodal datasets and 3 newly created datasets (IQTest, FunctionQA, and PaperQA), combining challenges from diverse mathematical and visual tasks to assess models' ability to understand complex figures and perform rigorous reasoning.

MathVista is a multimodal benchmark evaluating models on math, multimodal, and vision tasks. LLM Stats tracks 39 models on this benchmark, scored on a 0–1 scale. The current average is 0.7, with the leader at 0.9.

Compare leaders on the best AI for math, best AI for multimodal and best AI for vision leaderboards.

Current leaders

Seed 2.1 Pro from ByteDance currently leads the MathVista leaderboard with a score of 0.907 across 39 evaluated AI models.

1Seed 2.1 ProByteDance90.7%
2Seed 2.1 TurboByteDance90.5%
3o3OpenAI86.8%
OSSStep3-VL-10B#5 open-weight84.0%

Source paper

Title
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Authors
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, and 6 others
Published
Abstract

Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit impressive problem-solving skills in many tasks and domains, but their ability in mathematical reasoning in visual contexts has not been systematically studied. To bridge this gap, we present MathVista, a benchmark designed to combine challenges from diverse mathematical and visual tasks. It consists of 6,141 examples, derived from 28 existing multimodal datasets involving mathematics and 3 newly created datasets (i.e., IQTest, FunctionQA, and PaperQA). Completing these tasks requires fine-grained, deep visual understanding and compositional reasoning, which all state-of-the-art foundation models find challenging. With MathVista, we have conducted a comprehensive, quantitative evaluation of 12 prominent foundation models. The best-performing GPT-4V model achieves an overall accuracy of 49.9%, substantially outperforming Bard, the second-best performer, by 15.1%. Our in-depth analysis reveals that the superiority of GPT-4V is mainly attributed to its enhanced visual perception and mathematical reasoning. However, GPT-4V still falls short of human performance by 10.4%, as it often struggles to understand complex figures and perform rigorous reasoning. This significant gap underscores the critical role that MathVista will play in the development of general-purpose AI agents capable of tackling mathematically intensive and visually rich real-world tasks. We further explore the new ability of self-verification, the application of self-consistency, and the interactive chatbot capabilities of GPT-4V, highlighting its promising potential for future research. The project is available at https://mathvista.github.io/.

FAQ

Common questions about the MathVista benchmark and leaderboard.

What is the MathVista benchmark?

MathVista evaluates mathematical reasoning of foundation models in visual contexts. It consists of 6,141 examples derived from 28 existing multimodal datasets and 3 newly created datasets (IQTest, FunctionQA, and PaperQA), combining challenges from diverse mathematical and visual tasks to assess models' ability to understand complex figures and perform rigorous reasoning.

What is the MathVista leaderboard?

The MathVista leaderboard ranks 39 AI models based on their performance on this benchmark. Currently, Seed 2.1 Pro by ByteDance leads with a score of 0.907. The average score across all models is 0.650.

What is the highest MathVista score?

The highest MathVista score is 0.907, achieved by Seed 2.1 Pro from ByteDance.

How many models are evaluated on MathVista?

39 models have been evaluated on the MathVista benchmark, with 0 verified results and 37 self-reported results.

Where can I find the MathVista paper?

The MathVista paper is available at https://arxiv.org/abs/2310.02255. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does MathVista cover?

MathVista is categorized under math, multimodal, and vision. The benchmark evaluates multimodal models.

What is the best open-source model on MathVista?

Step3-VL-10B by StepFun is the top-ranked open-source model on MathVista, with a score of 0.840 (rank #5).

How recent are the MathVista leaderboard results?

The MathVista leaderboard was last updated in July 2026 and currently includes 39 evaluated models.