TextVQA
TextVQA contains 45,336 questions on 28,408 images that require reasoning about text to answer. Introduced to benchmark VQA models' ability to read and reason about text within images, particularly for assistive technologies for visually impaired users. The dataset addresses the gap where existing VQA datasets had few text-based questions or were too small.
Progress Over Time
Interactive timeline showing model performance evolution on TextVQA
State-of-the-art frontier
Open
Proprietary
TextVQA Leaderboard
15 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Alibaba Cloud / Qwen Team | 73B | — | — | ||
| 2 | Alibaba Cloud / Qwen Team | 8B | — | — | ||
| 3 | Alibaba Cloud / Qwen Team | 7B | — | — | ||
| 4 | DeepSeek | 27B | 129K | — | ||
| 5 | DeepSeek | 16B | — | — | ||
| 6 | Amazon | — | 300K | $0.80 / $3.20 | ||
| 7 | DeepSeek | 3B | — | — | ||
| 8 | Amazon | — | 300K | $0.06 / $0.24 | ||
| 9 | xAI | — | — | — | ||
| 10 | Microsoft | 6B | 128K | $0.05 / $0.10 | ||
| 11 | 90B | 128K | $0.35 / $0.40 | |||
| 12 | Microsoft | 4B | — | — | ||
| 13 | Google | 12B | 131K | $0.05 / $0.10 | ||
| 14 | Google | 27B | 131K | $0.10 / $0.20 | ||
| 15 | Google | 4B | 131K | $0.02 / $0.04 |
Notice missing or incorrect data?
FAQ
Common questions about TextVQA
TextVQA contains 45,336 questions on 28,408 images that require reasoning about text to answer. Introduced to benchmark VQA models' ability to read and reason about text within images, particularly for assistive technologies for visually impaired users. The dataset addresses the gap where existing VQA datasets had few text-based questions or were too small.
The TextVQA paper is available at https://arxiv.org/abs/1904.08920. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The TextVQA leaderboard ranks 15 AI models based on their performance on this benchmark. Currently, Qwen2-VL-72B-Instruct by Alibaba Cloud / Qwen Team leads with a score of 0.855. The average score across all models is 0.770.
The highest TextVQA score is 0.855, achieved by Qwen2-VL-72B-Instruct from Alibaba Cloud / Qwen Team.
15 models have been evaluated on the TextVQA benchmark, with 0 verified results and 15 self-reported results.
TextVQA is categorized under vision, image to text, and multimodal. The benchmark evaluates multimodal models.