AutoLogi

Paper

Progress Over Time

Interactive timeline showing model performance evolution on AutoLogi

State-of-the-art frontier
Open
Proprietary

AutoLogi Leaderboard

2 models
ContextCostLicense
1
Moonshot AI
Moonshot AI
1.0T
11.0T
Notice missing or incorrect data?
About this benchmark

What is AutoLogi?

AutoLogi is an automated method for synthesizing open-ended logic puzzles to evaluate reasoning abilities of Large Language Models. The benchmark addresses limitations of existing multiple-choice reasoning evaluations by featuring program-based verification and controllable difficulty levels. It includes 1,575 English and 883 Chinese puzzles, enabling more reliable evaluation that better distinguishes models' reasoning capabilities across languages.

AutoLogi is a text benchmark evaluating models on reasoning tasks. LLM Stats tracks 2 models on this benchmark, scored on a 0–1 scale. The current average is 0.9, with the leader at 0.9.

Compare leaders on the best AI for reasoning leaderboards.

Current leaders

Kimi K2 Instruct from Moonshot AI currently leads the AutoLogi leaderboard with a score of 0.895 across 2 evaluated AI models.

1Kimi K2 InstructMoonshot AI89.5%
1Kimi K2-Instruct-0905Moonshot AI89.5%

Source paper

Title
AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language Models
Authors
Qin Zhu, Fei Huang, Runyu Peng, Keming Lu, and 5 others
Published
Abstract

While logical reasoning evaluation of Large Language Models (LLMs) has attracted significant attention, existing benchmarks predominantly rely on multiple-choice formats that are vulnerable to random guessing, leading to overestimated performance and substantial performance fluctuations. To obtain more accurate assessments of models' reasoning capabilities, we propose an automated method for synthesizing open-ended logic puzzles, and use it to develop a bilingual benchmark, AutoLogi. Our approach features program-based verification and controllable difficulty levels, enabling more reliable evaluation that better distinguishes models' reasoning abilities. Extensive evaluation of eight modern LLMs shows that AutoLogi can better reflect true model capabilities, with performance scores spanning from 35% to 73% compared to the narrower range of 21% to 37% on the source multiple-choice dataset. Beyond benchmark creation, this synthesis method can generate high-quality training data by incorporating program verifiers into the rejection sampling process, enabling systematic enhancement of LLMs' reasoning capabilities across diverse datasets.

FAQ

Common questions about the AutoLogi benchmark and leaderboard.

What is the AutoLogi benchmark?

AutoLogi is an automated method for synthesizing open-ended logic puzzles to evaluate reasoning abilities of Large Language Models. The benchmark addresses limitations of existing multiple-choice reasoning evaluations by featuring program-based verification and controllable difficulty levels. It includes 1,575 English and 883 Chinese puzzles, enabling more reliable evaluation that better distinguishes models' reasoning capabilities across languages.

What is the AutoLogi leaderboard?

The AutoLogi leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, Kimi K2 Instruct by Moonshot AI leads with a score of 0.895. The average score across all models is 0.895.

What is the highest AutoLogi score?

The highest AutoLogi score is 0.895, achieved by Kimi K2 Instruct from Moonshot AI.

How many models are evaluated on AutoLogi?

2 models have been evaluated on the AutoLogi benchmark, with 0 verified results and 2 self-reported results.

Where can I find the AutoLogi paper?

The AutoLogi paper is available at https://arxiv.org/abs/2502.16906. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does AutoLogi cover?

AutoLogi is categorized under reasoning. The benchmark evaluates text models with multilingual support.

What is the best open-source model on AutoLogi?

Kimi K2 Instruct by Moonshot AI is the top-ranked open-source model on AutoLogi, with a score of 0.895 (rank #1).

How recent are the AutoLogi leaderboard results?

The AutoLogi leaderboard was last updated in July 2026 and currently includes 2 evaluated models.