TextVQA

TextVQA contains 45,336 questions on 28,408 images that require reasoning about text to answer. Introduced to benchmark VQA models' ability to read and reason about text within images, particularly for assistive technologies for visually impaired users. The dataset addresses the gap where existing VQA datasets had few text-based questions or were too small.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on TextVQA

State-of-the-art frontier
Open
Proprietary

TextVQA Leaderboard

15 models
ContextCostLicense
1
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
73B
2
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
8B
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
7B
4
DeepSeek
DeepSeek
27B129K
516B
6
Amazon
Amazon
300K$0.80 / $3.20
73B
8
Amazon
Amazon
300K$0.06 / $0.24
9
106B128K$0.05 / $0.10
1190B128K$0.35 / $0.40
124B
1312B131K$0.05 / $0.10
1427B131K$0.10 / $0.20
154B131K$0.02 / $0.04
Notice missing or incorrect data?

FAQ

Common questions about TextVQA

TextVQA contains 45,336 questions on 28,408 images that require reasoning about text to answer. Introduced to benchmark VQA models' ability to read and reason about text within images, particularly for assistive technologies for visually impaired users. The dataset addresses the gap where existing VQA datasets had few text-based questions or were too small.
The TextVQA paper is available at https://arxiv.org/abs/1904.08920. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The TextVQA leaderboard ranks 15 AI models based on their performance on this benchmark. Currently, Qwen2-VL-72B-Instruct by Alibaba Cloud / Qwen Team leads with a score of 0.855. The average score across all models is 0.770.
The highest TextVQA score is 0.855, achieved by Qwen2-VL-72B-Instruct from Alibaba Cloud / Qwen Team.
15 models have been evaluated on the TextVQA benchmark, with 0 verified results and 15 self-reported results.
TextVQA is categorized under vision, image to text, and multimodal. The benchmark evaluates multimodal models.