VQAv2
VQAv2 is a balanced Visual Question Answering dataset that addresses language bias by providing complementary images for each question, forcing models to rely on visual understanding rather than language priors. It contains approximately twice the number of image-question pairs compared to the original VQA dataset.
Progress Over Time
Interactive timeline showing model performance evolution on VQAv2
State-of-the-art frontier
Open
Proprietary
VQAv2 Leaderboard
3 models • 0 verified
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
1 | Mistral AI | 124B | 128K | $2.00 $6.00 | ||
2 | Mistral AI | 12B | 128K | $0.15 $0.15 | ||
3 | 90B | 128K | $0.35 $0.40 |
Notice missing or incorrect data?
FAQ
Common questions about VQAv2
VQAv2 is a balanced Visual Question Answering dataset that addresses language bias by providing complementary images for each question, forcing models to rely on visual understanding rather than language priors. It contains approximately twice the number of image-question pairs compared to the original VQA dataset.
The VQAv2 paper is available at https://arxiv.org/abs/1612.00837. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The VQAv2 leaderboard ranks 3 AI models based on their performance on this benchmark. Currently, Pixtral Large by Mistral AI leads with a score of 0.809. The average score across all models is 0.792.
The highest VQAv2 score is 0.809, achieved by Pixtral Large from Mistral AI.
3 models have been evaluated on the VQAv2 benchmark, with 0 verified results and 3 self-reported results.
VQAv2 is categorized under image to text, multimodal, reasoning, and vision. The benchmark evaluates multimodal models.