VQAv2

Paper

Progress Over Time

Interactive timeline showing model performance evolution on VQAv2

State-of-the-art frontier
Open
Proprietary

VQAv2 Leaderboard

3 models
ContextCostLicense
1
Mistral AI
Mistral AI
124B
2
Mistral AI
Mistral AI
12B
390B
Notice missing or incorrect data?
About this benchmark

What is VQAv2?

VQAv2 is a balanced Visual Question Answering dataset that addresses language bias by providing complementary images for each question, forcing models to rely on visual understanding rather than language priors. It contains approximately twice the number of image-question pairs compared to the original VQA dataset.

VQAv2 is a multimodal benchmark evaluating models on multimodal, reasoning, image to text, and vision tasks. LLM Stats tracks 3 models on this benchmark, scored on a 0–1 scale. The current average is 0.8, with the leader at 0.8.

Compare leaders on the best AI for multimodal, best AI for reasoning, best AI for image to text and best AI for vision leaderboards.

Current leaders

Pixtral Large from Mistral AI currently leads the VQAv2 leaderboard with a score of 0.809 across 3 evaluated AI models.

1Pixtral LargeMistral AI80.9%
2Pixtral-12BMistral AI78.6%

Source paper

Title
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
Authors
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and 1 others
Published
Abstract

Problems at the intersection of vision and language are of significant importance both as challenging research questions and for the rich set of applications they enable. However, inherent structure in our world and bias in our language tend to be a simpler signal for learning than visual modalities, resulting in models that ignore visual information, leading to an inflated sense of their capability. We propose to counter these language priors for the task of Visual Question Answering (VQA) and make vision (the V in VQA) matter! Specifically, we balance the popular VQA dataset by collecting complementary images such that every question in our balanced dataset is associated with not just a single image, but rather a pair of similar images that result in two different answers to the question. Our dataset is by construction more balanced than the original VQA dataset and has approximately twice the number of image-question pairs. Our complete balanced dataset is publicly available at www.visualqa.org as part of the 2nd iteration of the Visual Question Answering Dataset and Challenge (VQA v2.0). We further benchmark a number of state-of-art VQA models on our balanced dataset. All models perform significantly worse on our balanced dataset, suggesting that these models have indeed learned to exploit language priors. This finding provides the first concrete empirical evidence for what seems to be a qualitative sense among practitioners. Finally, our data collection protocol for identifying complementary images enables us to develop a novel interpretable model, which in addition to providing an answer to the given (image, question) pair, also provides a counter-example based explanation. Specifically, it identifies an image that is similar to the original image, but it believes has a different answer to the same question. This can help in building trust for machines among their users.

FAQ

Common questions about the VQAv2 benchmark and leaderboard.

What is the VQAv2 benchmark?

VQAv2 is a balanced Visual Question Answering dataset that addresses language bias by providing complementary images for each question, forcing models to rely on visual understanding rather than language priors. It contains approximately twice the number of image-question pairs compared to the original VQA dataset.

What is the VQAv2 leaderboard?

The VQAv2 leaderboard ranks 3 AI models based on their performance on this benchmark. Currently, Pixtral Large by Mistral AI leads with a score of 0.809. The average score across all models is 0.792.

What is the highest VQAv2 score?

The highest VQAv2 score is 0.809, achieved by Pixtral Large from Mistral AI.

How many models are evaluated on VQAv2?

3 models have been evaluated on the VQAv2 benchmark, with 0 verified results and 3 self-reported results.

Where can I find the VQAv2 paper?

The VQAv2 paper is available at https://arxiv.org/abs/1612.00837. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does VQAv2 cover?

VQAv2 is categorized under multimodal, reasoning, image to text, and vision. The benchmark evaluates multimodal models.

What is the best open-source model on VQAv2?

Pixtral Large by Mistral AI is the top-ranked open-source model on VQAv2, with a score of 0.809 (rank #1).

How recent are the VQAv2 leaderboard results?

The VQAv2 leaderboard was last updated in June 2026 and currently includes 3 evaluated models.