CharXiv-R
CharXiv-R is the reasoning component of the CharXiv benchmark, focusing on complex reasoning questions that require synthesizing information across visual chart elements. It evaluates multimodal large language models on their ability to understand and reason about scientific charts from arXiv papers through various reasoning tasks.
Progress Over Time
Interactive timeline showing model performance evolution on CharXiv-R
State-of-the-art frontier
Open
Proprietary
CharXiv-R Leaderboard
30 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Anthropic | — | — | $25.00 / $125.00 | ||
| 2 | Muse SparkNew Meta | — | — | — | ||
| 3 | OpenAI | — | 400K | $1.75 / $14.00 | ||
| 4 | Alibaba Cloud / Qwen Team | — | — | — | ||
| 5 | Google | — | — | — | ||
| 6 | OpenAI | — | 400K | $1.25 / $10.00 | ||
| 7 | Google | — | 1.0M | $0.50 / $3.00 | ||
| 8 | Alibaba Cloud / Qwen Team | 27B | 262K | $0.30 / $2.40 | ||
| 9 | OpenAI | — | 200K | $2.00 / $8.00 | ||
| 10 | Moonshot AI | 1.0T | 262K | $0.60 / $2.50 | ||
| 10 | Alibaba Cloud / Qwen Team | 35B | 262K | $0.25 / $2.00 | ||
| 12 | Anthropic | — | 1.0M | $5.00 / $25.00 | ||
| 13 | Alibaba Cloud / Qwen Team | 122B | 262K | $0.40 / $3.20 | ||
| 14 | Google | — | 1.0M | $0.25 / $1.50 | ||
| 15 | OpenAI | — | 200K | $1.10 / $4.40 | ||
| 16 | Alibaba Cloud / Qwen Team | 236B | 262K | $0.45 / $3.49 | ||
| 17 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 18 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 19 | Alibaba Cloud / Qwen Team | 236B | 262K | $0.30 / $1.49 | ||
| 20 | OpenAI | — | 128K | $2.50 / $10.00 | ||
| 21 | OpenAI | — | 1.0M | $0.40 / $1.60 | ||
| 22 | OpenAI | — | 1.0M | $2.00 / $8.00 | ||
| 23 | Alibaba Cloud / Qwen Team | 31B | 262K | $0.20 / $1.00 | ||
| 24 | OpenAI | — | 128K | $75.00 / $150.00 | ||
| 25 | Alibaba Cloud / Qwen Team | 9B | 262K | $0.18 / $2.09 | ||
| 26 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 / $1.00 | ||
| 27 | Alibaba Cloud / Qwen Team | 31B | 262K | $0.20 / $0.70 | ||
| 28 | Alibaba Cloud / Qwen Team | 9B | 262K | $0.08 / $0.50 | ||
| 29 | OpenAI | — | 1.0M | $0.10 / $0.40 | ||
| 30 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 / $0.60 |
Notice missing or incorrect data?
FAQ
Common questions about CharXiv-R
CharXiv-R is the reasoning component of the CharXiv benchmark, focusing on complex reasoning questions that require synthesizing information across visual chart elements. It evaluates multimodal large language models on their ability to understand and reason about scientific charts from arXiv papers through various reasoning tasks.
The CharXiv-R paper is available at https://arxiv.org/abs/2406.18521. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The CharXiv-R leaderboard ranks 30 AI models based on their performance on this benchmark. Currently, Claude Mythos Preview by Anthropic leads with a score of 0.932. The average score across all models is 0.673.
The highest CharXiv-R score is 0.932, achieved by Claude Mythos Preview from Anthropic.
30 models have been evaluated on the CharXiv-R benchmark, with 0 verified results and 30 self-reported results.
CharXiv-R is categorized under vision, multimodal, and reasoning. The benchmark evaluates multimodal models.