AutoLogi

AutoLogi is an automated method for synthesizing open-ended logic puzzles to evaluate reasoning abilities of Large Language Models. The benchmark addresses limitations of existing multiple-choice reasoning evaluations by featuring program-based verification and controllable difficulty levels. It includes 1,575 English and 883 Chinese puzzles, enabling more reliable evaluation that better distinguishes models' reasoning capabilities across languages.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on AutoLogi

State-of-the-art frontier
Open
Proprietary

AutoLogi Leaderboard

2 models
ContextCostLicense
1
Moonshot AI
Moonshot AI
1.0T200K$0.50 / $0.50
11.0T
Notice missing or incorrect data?

FAQ

Common questions about AutoLogi

AutoLogi is an automated method for synthesizing open-ended logic puzzles to evaluate reasoning abilities of Large Language Models. The benchmark addresses limitations of existing multiple-choice reasoning evaluations by featuring program-based verification and controllable difficulty levels. It includes 1,575 English and 883 Chinese puzzles, enabling more reliable evaluation that better distinguishes models' reasoning capabilities across languages.
The AutoLogi paper is available at https://arxiv.org/abs/2502.16906. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The AutoLogi leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, Kimi K2 Instruct by Moonshot AI leads with a score of 0.895. The average score across all models is 0.895.
The highest AutoLogi score is 0.895, achieved by Kimi K2 Instruct from Moonshot AI.
2 models have been evaluated on the AutoLogi benchmark, with 0 verified results and 2 self-reported results.
AutoLogi is categorized under reasoning. The benchmark evaluates text models with multilingual support.