CruxEval-O
CruxEval-O is the output prediction task of the CRUXEval benchmark, designed to evaluate code reasoning, understanding, and execution capabilities. It consists of 800 Python functions (3-13 lines) where models must predict the output given a function and input. The benchmark tests fundamental code execution reasoning abilities and goes beyond simple code generation to assess deeper understanding of program behavior.
Progress Over Time
Interactive timeline showing model performance evolution on CruxEval-O
State-of-the-art frontier
Open
Proprietary
CruxEval-O Leaderboard
1 models • 0 verified
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
1 | Mistral AI | 22B | — | — |
Notice missing or incorrect data?
FAQ
Common questions about CruxEval-O
CruxEval-O is the output prediction task of the CRUXEval benchmark, designed to evaluate code reasoning, understanding, and execution capabilities. It consists of 800 Python functions (3-13 lines) where models must predict the output given a function and input. The benchmark tests fundamental code execution reasoning abilities and goes beyond simple code generation to assess deeper understanding of program behavior.
The CruxEval-O paper is available at https://arxiv.org/abs/2401.03065. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The CruxEval-O leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Codestral-22B by Mistral AI leads with a score of 0.513. The average score across all models is 0.513.
The highest CruxEval-O score is 0.513, achieved by Codestral-22B from Mistral AI.
1 models have been evaluated on the CruxEval-O benchmark, with 0 verified results and 1 self-reported results.
CruxEval-O is categorized under reasoning. The benchmark evaluates text models.