VQAv2

VQAv2 is a balanced Visual Question Answering dataset that addresses language bias by providing complementary images for each question, forcing models to rely on visual understanding rather than language priors. It contains approximately twice the number of image-question pairs compared to the original VQA dataset.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on VQAv2

State-of-the-art frontier
Open
Proprietary

VQAv2 Leaderboard

3 models • 0 verified
ContextCostLicense
1
Mistral AI
Mistral AI
124B128K
$2.00
$6.00
2
Mistral AI
Mistral AI
12B128K
$0.15
$0.15
3
90B128K
$0.35
$0.40
Notice missing or incorrect data?

FAQ

Common questions about VQAv2

VQAv2 is a balanced Visual Question Answering dataset that addresses language bias by providing complementary images for each question, forcing models to rely on visual understanding rather than language priors. It contains approximately twice the number of image-question pairs compared to the original VQA dataset.
The VQAv2 paper is available at https://arxiv.org/abs/1612.00837. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The VQAv2 leaderboard ranks 3 AI models based on their performance on this benchmark. Currently, Pixtral Large by Mistral AI leads with a score of 0.809. The average score across all models is 0.792.
The highest VQAv2 score is 0.809, achieved by Pixtral Large from Mistral AI.
3 models have been evaluated on the VQAv2 benchmark, with 0 verified results and 3 self-reported results.
VQAv2 is categorized under image to text, multimodal, reasoning, and vision. The benchmark evaluates multimodal models.