ZebraLogic

ZebraLogic is an evaluation framework for assessing large language models' logical reasoning capabilities through logic grid puzzles derived from constraint satisfaction problems (CSPs). The benchmark consists of 1,000 programmatically generated puzzles with controllable and quantifiable complexity, revealing a 'curse of complexity' where model accuracy declines significantly as problem complexity grows.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on ZebraLogic

State-of-the-art frontier
Open
Proprietary

ZebraLogic Leaderboard

8 models
ContextCostLicense
1
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B262K$0.45 / $3.49
2560B128K$0.30 / $1.20
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B262K$0.15 / $0.80
4560B128K$0.30 / $1.20
5
Moonshot AI
Moonshot AI
1.0T200K$0.50 / $0.50
51.0T
7456B1.0M$0.55 / $2.20
8456B
Notice missing or incorrect data?

FAQ

Common questions about ZebraLogic

ZebraLogic is an evaluation framework for assessing large language models' logical reasoning capabilities through logic grid puzzles derived from constraint satisfaction problems (CSPs). The benchmark consists of 1,000 programmatically generated puzzles with controllable and quantifiable complexity, revealing a 'curse of complexity' where model accuracy declines significantly as problem complexity grows.
The ZebraLogic paper is available at https://arxiv.org/abs/2502.01100. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The ZebraLogic leaderboard ranks 8 AI models based on their performance on this benchmark. Currently, Qwen3 VL 235B A22B Thinking by Alibaba Cloud / Qwen Team leads with a score of 0.973. The average score across all models is 0.903.
The highest ZebraLogic score is 0.973, achieved by Qwen3 VL 235B A22B Thinking from Alibaba Cloud / Qwen Team.
8 models have been evaluated on the ZebraLogic benchmark, with 0 verified results and 8 self-reported results.
ZebraLogic is categorized under reasoning. The benchmark evaluates text models.