ARC-C

The AI2 Reasoning Challenge (ARC) Challenge Set is a multiple-choice question-answering benchmark containing grade-school level science questions that require advanced reasoning capabilities. ARC-C specifically contains questions that were answered incorrectly by both retrieval-based and word co-occurrence algorithms, making it a particularly challenging subset designed to test commonsense reasoning abilities in AI systems.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on ARC-C

State-of-the-art frontier
Open
Proprietary

ARC-C Leaderboard

33 models
ContextCostLicense
1405B128K$0.89 / $0.89
2
Anthropic
Anthropic
200K$15.00 / $75.00
3
Amazon
Amazon
300K$0.80 / $3.20
370B128K$0.20 / $0.20
5200K$3.00 / $15.00
6398B256K$2.00 / $8.00
7
Amazon
Amazon
300K$0.06 / $0.24
824B
960B
10128K$0.03 / $0.14
11200K$0.25 / $1.25
1252B256K$0.20 / $0.40
134B128K$0.10 / $0.10
14
Microsoft
Microsoft
4B
158B131K$0.03 / $0.03
163B128K$0.01 / $0.02
178B128K$0.10 / $0.10
1827B
19104B128K$0.25 / $1.00
20
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
32B128K$0.09 / $0.09
21
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
2270B
23
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
72B
249B
25
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
15B
26
Nous Research
Nous Research
70B
272B
278B
29
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
7B
308B
302B
328B
3321B128K$0.40 / $4.00
Notice missing or incorrect data?

FAQ

Common questions about ARC-C

The AI2 Reasoning Challenge (ARC) Challenge Set is a multiple-choice question-answering benchmark containing grade-school level science questions that require advanced reasoning capabilities. ARC-C specifically contains questions that were answered incorrectly by both retrieval-based and word co-occurrence algorithms, making it a particularly challenging subset designed to test commonsense reasoning abilities in AI systems.
The ARC-C paper is available at https://arxiv.org/abs/1803.05457. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The ARC-C leaderboard ranks 33 AI models based on their performance on this benchmark. Currently, Llama 3.1 405B Instruct by Meta leads with a score of 0.969. The average score across all models is 0.761.
The highest ARC-C score is 0.969, achieved by Llama 3.1 405B Instruct from Meta.
33 models have been evaluated on the ARC-C benchmark, with 0 verified results and 33 self-reported results.
ARC-C is categorized under general and reasoning. The benchmark evaluates text models.