MathVista
Progress Over Time
Interactive timeline showing model performance evolution on MathVista
MathVista Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | ByteDance | — | — | — | ||
| 2 | ByteDance | — | — | — | ||
| 3 | OpenAI | — | — | — | ||
| 4 | OpenAI | — | — | — | ||
| 5 | StepFun | 10B | — | — | ||
| 6 | Cohere | 218B | — | — | ||
| 7 | Moonshot AI | — | — | — | ||
| 8 | Meta | 400B | — | — | ||
| 9 | OpenAI | — | 1.0M | $0.40 / $1.60 | ||
| 10 | OpenAI | — | — | — | ||
| 11 | OpenAI | — | 1.0M | $2.00 / $8.00 | ||
| 12 | OpenAI | — | — | — | ||
| 13 | Alibaba Cloud / Qwen Team | 73B | — | — | ||
| 14 | Meta | 109B | — | — | ||
| 15 | Mistral AI | 124B | — | — | ||
| 16 | xAI | — | — | — | ||
| 17 | Google | — | — | — | ||
| 17 | xAI | — | — | — | ||
| 19 | Alibaba Cloud / Qwen Team | 7B | — | — | ||
| 20 | Anthropic | — | — | — | ||
| 21 | Mistral AI | 24B | — | — | ||
| 22 | Google | — | — | — | ||
| 23 | OpenAI | — | 128K | $2.50 / $10.00 | ||
| 24 | DeepSeek | 27B | — | — | ||
| 25 | Microsoft | 6B | — | — | ||
| 26 | OpenAI | — | 128K | $2.50 / $10.00 | ||
| 27 | DeepSeek | 16B | — | — | ||
| 28 | Mistral AI | 12B | — | — | ||
| 29 | 90B | — | — | |||
| 30 | OpenAI | — | — | — | ||
| 31 | OpenAI | — | 1.0M | $0.10 / $0.40 | ||
| 32 | Google | 8B | — | — | ||
| 33 | DeepSeek | 3B | — | — | ||
| 34 | xAI | — | — | — | ||
| 34 | xAI | — | — | — | ||
| 36 | 11B | — | — | |||
| 37 | Google | — | — | — | ||
| 38 | Microsoft | 4B | — | — | ||
| 39 | OpenAI | — | 16K | $0.50 / $1.50 |
What is MathVista?
MathVista evaluates mathematical reasoning of foundation models in visual contexts. It consists of 6,141 examples derived from 28 existing multimodal datasets and 3 newly created datasets (IQTest, FunctionQA, and PaperQA), combining challenges from diverse mathematical and visual tasks to assess models' ability to understand complex figures and perform rigorous reasoning.
MathVista is a multimodal benchmark evaluating models on math, multimodal, and vision tasks. LLM Stats tracks 39 models on this benchmark, scored on a 0–1 scale. The current average is 0.7, with the leader at 0.9.
Compare leaders on the best AI for math, best AI for multimodal and best AI for vision leaderboards.
Current leaders
Seed 2.1 Pro from ByteDance currently leads the MathVista leaderboard with a score of 0.907 across 39 evaluated AI models.
Source paper
- Title
- MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
- Authors
- Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, and 6 others
- Published
- arXiv
- 2310.02255
Abstract
Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit impressive problem-solving skills in many tasks and domains, but their ability in mathematical reasoning in visual contexts has not been systematically studied. To bridge this gap, we present MathVista, a benchmark designed to combine challenges from diverse mathematical and visual tasks. It consists of 6,141 examples, derived from 28 existing multimodal datasets involving mathematics and 3 newly created datasets (i.e., IQTest, FunctionQA, and PaperQA). Completing these tasks requires fine-grained, deep visual understanding and compositional reasoning, which all state-of-the-art foundation models find challenging. With MathVista, we have conducted a comprehensive, quantitative evaluation of 12 prominent foundation models. The best-performing GPT-4V model achieves an overall accuracy of 49.9%, substantially outperforming Bard, the second-best performer, by 15.1%. Our in-depth analysis reveals that the superiority of GPT-4V is mainly attributed to its enhanced visual perception and mathematical reasoning. However, GPT-4V still falls short of human performance by 10.4%, as it often struggles to understand complex figures and perform rigorous reasoning. This significant gap underscores the critical role that MathVista will play in the development of general-purpose AI agents capable of tackling mathematically intensive and visually rich real-world tasks. We further explore the new ability of self-verification, the application of self-consistency, and the interactive chatbot capabilities of GPT-4V, highlighting its promising potential for future research. The project is available at https://mathvista.github.io/.
FAQ
Common questions about the MathVista benchmark and leaderboard.