OCRBench_V2
Progress Over Time
Interactive timeline showing model performance evolution on OCRBench_V2
OCRBench_V2 Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Alibaba Cloud / Qwen Team | — | 1.0M | $0.32 / $1.28 | ||
| 2 | Amazon | — | — | — | ||
| 3 | ByteDance | — | — | — | ||
| 4 | ByteDance | — | — | — | ||
| 5 | Amazon | — | — | — | ||
| 6 | Alibaba Cloud / Qwen Team | 7B | — | — | ||
| 7 | Amazon | — | 1.0M | $0.30 / $2.50 |
What is OCRBench_V2?
OCRBench v2: Enhanced large-scale bilingual benchmark for evaluating Large Multimodal Models on visual text localization and reasoning with 10,000 human-verified question-answering pairs across 8 core OCR capabilities
OCRBench_V2 is a multimodal benchmark evaluating models on image to text and vision tasks. LLM Stats tracks 7 models on this benchmark, scored on a 0–1 scale. The current average is 0.6, with the leader at 0.7.
Compare leaders on the best AI for image to text and best AI for vision leaderboards.
Current leaders
Qwen3.7-Plus from Alibaba Cloud / Qwen Team currently leads the OCRBench_V2 leaderboard with a score of 0.671 across 7 evaluated AI models.
Source paper
- Title
- OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
- Authors
- Ling Fu, Zhebin Kuang, Jiajun Song, Mingxin Huang, and 20 others
- Published
- arXiv
- 2501.00321
Abstract
Scoring the Optical Character Recognition (OCR) capabilities of Large Multimodal Models (LMMs) has witnessed growing interest. Existing benchmarks have highlighted the impressive performance of LMMs in text recognition; however, their abilities in certain challenging tasks, such as text localization, handwritten content extraction, and logical reasoning, remain underexplored. To bridge this gap, we introduce OCRBench v2, a large-scale bilingual text-centric benchmark with currently the most comprehensive set of tasks (4x more tasks than the previous multi-scene benchmark OCRBench), the widest coverage of scenarios (31 diverse scenarios), and thorough evaluation metrics, with 10,000 human-verified question-answering pairs and a high proportion of difficult samples. Moreover, we construct a private test set with 1,500 manually annotated images. The consistent evaluation trends observed across both public and private test sets validate the OCRBench v2's reliability. After carefully benchmarking state-of-the-art LMMs, we find that most LMMs score below 50 (100 in total) and suffer from five-type limitations, including less frequently encountered text recognition, fine-grained perception, layout perception, complex element parsing, and logical reasoning. The project website is at: https://99franklin.github.io/ocrbench_v2/
FAQ
Common questions about the OCRBench_V2 benchmark and leaderboard.