MathVista-Mini

Name: MathVista-Mini Leaderboard — AI Model Scores
Creator: LLM Stats
License: https://llm-stats.com/legal/terms-of-service

Paper

Progress Over Time

Interactive timeline showing model performance evolution on MathVista-Mini

State-of-the-art frontier

Open

Proprietary

MathVista-Mini Leaderboard

23 models

			Context	Cost
1	Kimi K2.5 Moonshot AI	1.0T	—	—
2	Qwen3.5-27B Alibaba Cloud / Qwen Team	27B	262K	$0.30 / $2.40
3	Qwen3.5-122B-A10B Alibaba Cloud / Qwen Team	122B	—	—
3	Qwen3.6-27B Alibaba Cloud / Qwen Team	28B	262K	$0.60 / $3.60
5	Qwen3.6-35B-A3B Alibaba Cloud / Qwen Team	35B	—	—
6	Qwen3.5-35B-A3B Alibaba Cloud / Qwen Team	35B	—	—
7	Qwen3 VL 32B Thinking Alibaba Cloud / Qwen Team	33B	—	—
8	Qwen3 VL 235B A22B Thinking Alibaba Cloud / Qwen Team	236B	—	—
9	Qwen3 VL 235B A22B Instruct Alibaba Cloud / Qwen Team	236B	—	—
10	Qwen3 VL 32B Instruct Alibaba Cloud / Qwen Team	33B	—	—
11	Qwen3 VL 30B A3B Thinking Alibaba Cloud / Qwen Team	31B	—	—
12	Qwen3 VL 8B Thinking Alibaba Cloud / Qwen Team	9B	—	—
13	Qwen3 VL 30B A3B Instruct Alibaba Cloud / Qwen Team	31B	—	—
14	Qwen3 VL 4B Thinking Alibaba Cloud / Qwen Team	4B	262K	$0.10 / $1.00
15	Qwen3 VL 8B Instruct Alibaba Cloud / Qwen Team	9B	—	—
16	Qwen2.5 VL 72B Instruct Alibaba Cloud / Qwen Team	72B	—	—
17	Qwen2.5 VL 32B Instruct Alibaba Cloud / Qwen Team	34B	—	—
18	Qwen3 VL 4B Instruct Alibaba Cloud / Qwen Team	4B	262K	$0.10 / $0.60
19	Qwen2-VL-72B-Instruct Alibaba Cloud / Qwen Team	73B	—	—
20	Qwen2.5 VL 7B Instruct Alibaba Cloud / Qwen Team	8B	—	—
21	Gemma 3 27B Google	27B	—	—
22	Gemma 3 12B Google	12B	—	—
23	Gemma 3 4B Google	4B	—	—

Notice missing or incorrect data?

About this benchmark

What is MathVista-Mini?

MathVista-Mini is a smaller version of the MathVista benchmark that evaluates mathematical reasoning in visual contexts. It consists of examples derived from multimodal datasets involving mathematics, combining challenges from diverse mathematical and visual tasks to assess foundation models' ability to solve problems requiring both visual understanding and mathematical reasoning.

MathVista-Mini is a multimodal benchmark evaluating models on math, multimodal, and vision tasks. LLM Stats tracks 23 models on this benchmark, scored on a 0–1 scale. The current average is 0.8, with the leader at 0.9.

Compare leaders on the best AI for math, best AI for multimodal and best AI for vision leaderboards.

Current leaders

Kimi K2.5 from Moonshot AI currently leads the MathVista-Mini leaderboard with a score of 0.901 across 23 evaluated AI models.

Kimi K2.5Moonshot AI90.1%

Qwen3.5-27BAlibaba Cloud / Qwen Team87.8%

Qwen3.5-122B-A10BAlibaba Cloud / Qwen Team87.4%

Source paper

Title: MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Authors: Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, and 6 others
Published: October 3, 2023
arXiv: 2310.02255

Abstract

Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit impressive problem-solving skills in many tasks and domains, but their ability in mathematical reasoning in visual contexts has not been systematically studied. To bridge this gap, we present MathVista, a benchmark designed to combine challenges from diverse mathematical and visual tasks. It consists of 6,141 examples, derived from 28 existing multimodal datasets involving mathematics and 3 newly created datasets (i.e., IQTest, FunctionQA, and PaperQA). Completing these tasks requires fine-grained, deep visual understanding and compositional reasoning, which all state-of-the-art foundation models find challenging. With MathVista, we have conducted a comprehensive, quantitative evaluation of 12 prominent foundation models. The best-performing GPT-4V model achieves an overall accuracy of 49.9%, substantially outperforming Bard, the second-best performer, by 15.1%. Our in-depth analysis reveals that the superiority of GPT-4V is mainly attributed to its enhanced visual perception and mathematical reasoning. However, GPT-4V still falls short of human performance by 10.4%, as it often struggles to understand complex figures and perform rigorous reasoning. This significant gap underscores the critical role that MathVista will play in the development of general-purpose AI agents capable of tackling mathematically intensive and visually rich real-world tasks. We further explore the new ability of self-verification, the application of self-consistency, and the interactive chatbot capabilities of GPT-4V, highlighting its promising potential for future research. The project is available at https://mathvista.github.io/.

FAQ

Common questions about the MathVista-Mini benchmark and leaderboard.

What is the MathVista-Mini benchmark?

What is the MathVista-Mini leaderboard?

The MathVista-Mini leaderboard ranks 23 AI models based on their performance on this benchmark. Currently, Kimi K2.5 by Moonshot AI leads with a score of 0.901. The average score across all models is 0.786.

What is the highest MathVista-Mini score?

The highest MathVista-Mini score is 0.901, achieved by Kimi K2.5 from Moonshot AI.

How many models are evaluated on MathVista-Mini?

23 models have been evaluated on the MathVista-Mini benchmark, with 0 verified results and 23 self-reported results.

Where can I find the MathVista-Mini paper?

The MathVista-Mini paper is available at https://arxiv.org/abs/2310.02255. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does MathVista-Mini cover?

MathVista-Mini is categorized under math, multimodal, and vision. The benchmark evaluates multimodal models.

What is the best open-source model on MathVista-Mini?

Kimi K2.5 by Moonshot AI is the top-ranked open-source model on MathVista-Mini, with a score of 0.901 (rank #1).

Which model offers the best value on MathVista-Mini?

Among models scoring within 10% of the leader, Qwen3.5-27B from Alibaba Cloud / Qwen Team is the cheapest, at $0.30 per million input tokens with a score of 0.878.

How recent are the MathVista-Mini leaderboard results?

The MathVista-Mini leaderboard was last updated in July 2026 and currently includes 23 evaluated models.