Benchmarks/reasoning/CRUXEval-Input-CoT

CRUXEval-Input-CoT

CRUXEval input prediction task with Chain of Thought (CoT) prompting. Part of the CRUXEval benchmark for code reasoning, understanding, and execution evaluation. Given a Python function and its expected output, the task is to predict the appropriate input using chain-of-thought reasoning. Consists of 800 Python functions (3-13 lines) designed to evaluate code comprehension and reasoning capabilities.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on CRUXEval-Input-CoT

State-of-the-art frontier
Open
Proprietary

CRUXEval-Input-CoT Leaderboard

1 models • 0 verified
ContextCostLicense
1
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
7B
Notice missing or incorrect data?

FAQ

Common questions about CRUXEval-Input-CoT

CRUXEval input prediction task with Chain of Thought (CoT) prompting. Part of the CRUXEval benchmark for code reasoning, understanding, and execution evaluation. Given a Python function and its expected output, the task is to predict the appropriate input using chain-of-thought reasoning. Consists of 800 Python functions (3-13 lines) designed to evaluate code comprehension and reasoning capabilities.
The CRUXEval-Input-CoT paper is available at https://arxiv.org/abs/2401.03065. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The CRUXEval-Input-CoT leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Qwen2.5-Coder 7B Instruct by Alibaba Cloud / Qwen Team leads with a score of 0.565. The average score across all models is 0.565.
The highest CRUXEval-Input-CoT score is 0.565, achieved by Qwen2.5-Coder 7B Instruct from Alibaba Cloud / Qwen Team.
1 models have been evaluated on the CRUXEval-Input-CoT benchmark, with 0 verified results and 1 self-reported results.
CRUXEval-Input-CoT is categorized under reasoning. The benchmark evaluates text models.