TextVQA

Paper

Progress Over Time

Interactive timeline showing model performance evolution on TextVQA

State-of-the-art frontier
Open
Proprietary

TextVQA Leaderboard

15 models
ContextCostLicense
1
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
73B
2
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
8B
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
7B
4
DeepSeek
DeepSeek
27B
516B
6
Amazon
Amazon
73B
8
Amazon
Amazon
9
106B
1190B
124B
1312B
1427B
154B
Notice missing or incorrect data?
About this benchmark

What is TextVQA?

TextVQA contains 45,336 questions on 28,408 images that require reasoning about text to answer. Introduced to benchmark VQA models' ability to read and reason about text within images, particularly for assistive technologies for visually impaired users. The dataset addresses the gap where existing VQA datasets had few text-based questions or were too small.

TextVQA is a multimodal benchmark evaluating models on image to text, multimodal, and vision tasks. LLM Stats tracks 15 models on this benchmark, scored on a 0–1 scale. The current average is 0.8, with the leader at 0.9.

Compare leaders on the best AI for image to text, best AI for multimodal and best AI for vision leaderboards.

Current leaders

Qwen2-VL-72B-Instruct from Alibaba Cloud / Qwen Team currently leads the TextVQA leaderboard with a score of 0.855 across 15 evaluated AI models.

1Qwen2-VL-72B-InstructAlibaba Cloud / Qwen Team85.5%
2Qwen2.5 VL 7B InstructAlibaba Cloud / Qwen Team84.9%
3Qwen2.5-Omni-7BAlibaba Cloud / Qwen Team84.4%

Source paper

Title
Towards VQA Models That Can Read
Authors
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, and 4 others
Published
Abstract

Studies have shown that a dominant class of questions asked by visually impaired users on images of their surroundings involves reading text in the image. But today's VQA models can not read! Our paper takes a first step towards addressing this problem. First, we introduce a new "TextVQA" dataset to facilitate progress on this important problem. Existing datasets either have a small proportion of questions about text (e.g., the VQA dataset) or are too small (e.g., the VizWiz dataset). TextVQA contains 45,336 questions on 28,408 images that require reasoning about text to answer. Second, we introduce a novel model architecture that reads text in the image, reasons about it in the context of the image and the question, and predicts an answer which might be a deduction based on the text and the image or composed of the strings found in the image. Consequently, we call our approach Look, Read, Reason & Answer (LoRRA). We show that LoRRA outperforms existing state-of-the-art VQA models on our TextVQA dataset. We find that the gap between human performance and machine performance is significantly larger on TextVQA than on VQA 2.0, suggesting that TextVQA is well-suited to benchmark progress along directions complementary to VQA 2.0.

FAQ

Common questions about the TextVQA benchmark and leaderboard.

What is the TextVQA benchmark?

TextVQA contains 45,336 questions on 28,408 images that require reasoning about text to answer. Introduced to benchmark VQA models' ability to read and reason about text within images, particularly for assistive technologies for visually impaired users. The dataset addresses the gap where existing VQA datasets had few text-based questions or were too small.

What is the TextVQA leaderboard?

The TextVQA leaderboard ranks 15 AI models based on their performance on this benchmark. Currently, Qwen2-VL-72B-Instruct by Alibaba Cloud / Qwen Team leads with a score of 0.855. The average score across all models is 0.770.

What is the highest TextVQA score?

The highest TextVQA score is 0.855, achieved by Qwen2-VL-72B-Instruct from Alibaba Cloud / Qwen Team.

How many models are evaluated on TextVQA?

15 models have been evaluated on the TextVQA benchmark, with 0 verified results and 15 self-reported results.

Where can I find the TextVQA paper?

The TextVQA paper is available at https://arxiv.org/abs/1904.08920. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does TextVQA cover?

TextVQA is categorized under image to text, multimodal, and vision. The benchmark evaluates multimodal models.

What is the best open-source model on TextVQA?

Qwen2-VL-72B-Instruct by Alibaba Cloud / Qwen Team is the top-ranked open-source model on TextVQA, with a score of 0.855 (rank #1).

How recent are the TextVQA leaderboard results?

The TextVQA leaderboard was last updated in July 2026 and currently includes 15 evaluated models.