ChartQA
ChartQA is a large-scale benchmark comprising 9.6K human-written questions and 23.1K questions generated from human-written chart summaries, designed to evaluate models' abilities in visual and logical reasoning over charts.
Progress Over Time
Interactive timeline showing model performance evolution on ChartQA
State-of-the-art frontier
Open
Proprietary
ChartQA Leaderboard
24 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Anthropic | — | 200K | $3.00 / $15.00 | ||
| 2 | Meta | 400B | 1.0M | $0.17 / $0.60 | ||
| 3 | Alibaba Cloud / Qwen Team | 72B | — | — | ||
| 4 | Amazon | — | 300K | $0.80 / $3.20 | ||
| 5 | Meta | 109B | 10.0M | $0.08 / $0.30 | ||
| 6 | Alibaba Cloud / Qwen Team | 73B | — | — | ||
| 7 | Mistral AI | 124B | 128K | $2.00 / $6.00 | ||
| 8 | Mistral AI | 24B | — | — | ||
| 9 | Alibaba Cloud / Qwen Team | 8B | — | — | ||
| 10 | Amazon | — | 300K | $0.06 / $0.24 | ||
| 11 | DeepSeek | 27B | — | — | ||
| 12 | OpenAI | — | 128K | $2.50 / $10.00 | ||
| 13 | 90B | 128K | $0.35 / $0.40 | |||
| 14 | Alibaba Cloud / Qwen Team | 7B | — | — | ||
| 15 | DeepSeek | 16B | — | — | ||
| 16 | 11B | 128K | $0.05 / $0.05 | |||
| 17 | Mistral AI | 12B | 128K | $0.15 / $0.15 | ||
| 17 | Microsoft | 4B | — | — | ||
| 19 | Microsoft | 6B | 128K | $0.05 / $0.10 | ||
| 20 | DeepSeek | 3B | — | — | ||
| 21 | Google | 27B | 131K | $0.10 / $0.20 | ||
| 22 | xAI | — | — | — | ||
| 23 | Google | 12B | 131K | $0.05 / $0.10 | ||
| 24 | Google | 4B | 131K | $0.02 / $0.04 |
Notice missing or incorrect data?
FAQ
Common questions about ChartQA
ChartQA is a large-scale benchmark comprising 9.6K human-written questions and 23.1K questions generated from human-written chart summaries, designed to evaluate models' abilities in visual and logical reasoning over charts.
The ChartQA paper is available at https://arxiv.org/abs/2203.10244. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The ChartQA leaderboard ranks 24 AI models based on their performance on this benchmark. Currently, Claude 3.5 Sonnet by Anthropic leads with a score of 0.908. The average score across all models is 0.842.
The highest ChartQA score is 0.908, achieved by Claude 3.5 Sonnet from Anthropic.
24 models have been evaluated on the ChartQA benchmark, with 0 verified results and 24 self-reported results.
ChartQA is categorized under multimodal, reasoning, and vision. The benchmark evaluates multimodal models.