ChartQA

ChartQA is a large-scale benchmark comprising 9.6K human-written questions and 23.1K questions generated from human-written chart summaries, designed to evaluate models' abilities in visual and logical reasoning over charts.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on ChartQA

State-of-the-art frontier
Open
Proprietary

ChartQA Leaderboard

24 models
ContextCostLicense
1200K$3.00 / $15.00
2400B1.0M$0.17 / $0.60
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
72B
4
Amazon
Amazon
300K$0.80 / $3.20
5109B10.0M$0.08 / $0.30
6
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
73B
7
Mistral AI
Mistral AI
124B128K$2.00 / $6.00
824B
9
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
8B
10
Amazon
Amazon
300K$0.06 / $0.24
11
DeepSeek
DeepSeek
27B
12
OpenAI
OpenAI
128K$2.50 / $10.00
1390B128K$0.35 / $0.40
14
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
7B
1516B
1611B128K$0.05 / $0.05
17
Mistral AI
Mistral AI
12B128K$0.15 / $0.15
174B
196B128K$0.05 / $0.10
203B
2127B131K$0.10 / $0.20
22
2312B131K$0.05 / $0.10
244B131K$0.02 / $0.04
Notice missing or incorrect data?

FAQ

Common questions about ChartQA

ChartQA is a large-scale benchmark comprising 9.6K human-written questions and 23.1K questions generated from human-written chart summaries, designed to evaluate models' abilities in visual and logical reasoning over charts.
The ChartQA paper is available at https://arxiv.org/abs/2203.10244. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The ChartQA leaderboard ranks 24 AI models based on their performance on this benchmark. Currently, Claude 3.5 Sonnet by Anthropic leads with a score of 0.908. The average score across all models is 0.842.
The highest ChartQA score is 0.908, achieved by Claude 3.5 Sonnet from Anthropic.
24 models have been evaluated on the ChartQA benchmark, with 0 verified results and 24 self-reported results.
ChartQA is categorized under multimodal, reasoning, and vision. The benchmark evaluates multimodal models.