AutoLogi
AutoLogi is an automated method for synthesizing open-ended logic puzzles to evaluate reasoning abilities of Large Language Models. The benchmark addresses limitations of existing multiple-choice reasoning evaluations by featuring program-based verification and controllable difficulty levels. It includes 1,575 English and 883 Chinese puzzles, enabling more reliable evaluation that better distinguishes models' reasoning capabilities across languages.
Progress Over Time
Interactive timeline showing model performance evolution on AutoLogi
State-of-the-art frontier
Open
Proprietary
AutoLogi Leaderboard
2 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Moonshot AI | 1.0T | 200K | $0.50 / $0.50 | ||
| 1 | Moonshot AI | 1.0T | — | — |
Notice missing or incorrect data?
FAQ
Common questions about AutoLogi
AutoLogi is an automated method for synthesizing open-ended logic puzzles to evaluate reasoning abilities of Large Language Models. The benchmark addresses limitations of existing multiple-choice reasoning evaluations by featuring program-based verification and controllable difficulty levels. It includes 1,575 English and 883 Chinese puzzles, enabling more reliable evaluation that better distinguishes models' reasoning capabilities across languages.
The AutoLogi paper is available at https://arxiv.org/abs/2502.16906. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The AutoLogi leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, Kimi K2 Instruct by Moonshot AI leads with a score of 0.895. The average score across all models is 0.895.
The highest AutoLogi score is 0.895, achieved by Kimi K2 Instruct from Moonshot AI.
2 models have been evaluated on the AutoLogi benchmark, with 0 verified results and 2 self-reported results.
AutoLogi is categorized under reasoning. The benchmark evaluates text models with multilingual support.