ZebraLogic

Paper

Progress Over Time

Interactive timeline showing model performance evolution on ZebraLogic

State-of-the-art frontier
Open
Proprietary

ZebraLogic Leaderboard

8 models
ContextCostLicense
1
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B
2560B
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B
4560B
5
Moonshot AI
Moonshot AI
1.0T
51.0T
7456B
8456B
Notice missing or incorrect data?
About this benchmark

What is ZebraLogic?

ZebraLogic is an evaluation framework for assessing large language models' logical reasoning capabilities through logic grid puzzles derived from constraint satisfaction problems (CSPs). The benchmark consists of 1,000 programmatically generated puzzles with controllable and quantifiable complexity, revealing a 'curse of complexity' where model accuracy declines significantly as problem complexity grows.

ZebraLogic is a text benchmark evaluating models on reasoning tasks. LLM Stats tracks 8 models on this benchmark, scored on a 0–1 scale. The current average is 0.9, with the leader at 1.0.

Compare leaders on the best AI for reasoning leaderboards.

Current leaders

Qwen3 VL 235B A22B Thinking from Alibaba Cloud / Qwen Team currently leads the ZebraLogic leaderboard with a score of 0.973 across 8 evaluated AI models.

1Qwen3 VL 235B A22B ThinkingAlibaba Cloud / Qwen Team97.3%
3Qwen3-235B-A22B-Instruct-2507Alibaba Cloud / Qwen Team95.0%

Source paper

Title
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning
Authors
Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, and 3 others
Published
Abstract

We investigate the logical reasoning capabilities of large language models (LLMs) and their scalability in complex non-monotonic reasoning. To this end, we introduce ZebraLogic, a comprehensive evaluation framework for assessing LLM reasoning performance on logic grid puzzles derived from constraint satisfaction problems (CSPs). ZebraLogic enables the generation of puzzles with controllable and quantifiable complexity, facilitating a systematic study of the scaling limits of models such as Llama, o1 models, and DeepSeek-R1. By encompassing a broad range of search space complexities and diverse logical constraints, ZebraLogic provides a structured environment to evaluate reasoning under increasing difficulty. Our results reveal a significant decline in accuracy as problem complexity grows -- a phenomenon we term the curse of complexity. This limitation persists even with larger models and increased inference-time computation, suggesting inherent constraints in current LLM reasoning capabilities. Additionally, we explore strategies to enhance logical reasoning, including Best-of-N sampling, backtracking mechanisms, and self-verification prompts. Our findings offer critical insights into the scalability of LLM reasoning, highlight fundamental limitations, and outline potential directions for improvement.

FAQ

Common questions about the ZebraLogic benchmark and leaderboard.

What is the ZebraLogic benchmark?

ZebraLogic is an evaluation framework for assessing large language models' logical reasoning capabilities through logic grid puzzles derived from constraint satisfaction problems (CSPs). The benchmark consists of 1,000 programmatically generated puzzles with controllable and quantifiable complexity, revealing a 'curse of complexity' where model accuracy declines significantly as problem complexity grows.

What is the ZebraLogic leaderboard?

The ZebraLogic leaderboard ranks 8 AI models based on their performance on this benchmark. Currently, Qwen3 VL 235B A22B Thinking by Alibaba Cloud / Qwen Team leads with a score of 0.973. The average score across all models is 0.903.

What is the highest ZebraLogic score?

The highest ZebraLogic score is 0.973, achieved by Qwen3 VL 235B A22B Thinking from Alibaba Cloud / Qwen Team.

How many models are evaluated on ZebraLogic?

8 models have been evaluated on the ZebraLogic benchmark, with 0 verified results and 8 self-reported results.

Where can I find the ZebraLogic paper?

The ZebraLogic paper is available at https://arxiv.org/abs/2502.01100. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does ZebraLogic cover?

ZebraLogic is categorized under reasoning. The benchmark evaluates text models.

What is the best open-source model on ZebraLogic?

Qwen3 VL 235B A22B Thinking by Alibaba Cloud / Qwen Team is the top-ranked open-source model on ZebraLogic, with a score of 0.973 (rank #1).

How recent are the ZebraLogic leaderboard results?

The ZebraLogic leaderboard was last updated in July 2026 and currently includes 8 evaluated models.