ZebraLogic
ZebraLogic is an evaluation framework for assessing large language models' logical reasoning capabilities through logic grid puzzles derived from constraint satisfaction problems (CSPs). The benchmark consists of 1,000 programmatically generated puzzles with controllable and quantifiable complexity, revealing a 'curse of complexity' where model accuracy declines significantly as problem complexity grows.
Progress Over Time
Interactive timeline showing model performance evolution on ZebraLogic
State-of-the-art frontier
Open
Proprietary
ZebraLogic Leaderboard
8 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Alibaba Cloud / Qwen Team | 236B | 262K | $0.45 / $3.49 | ||
| 2 | Meituan | 560B | 128K | $0.30 / $1.20 | ||
| 3 | Alibaba Cloud / Qwen Team | 235B | 262K | $0.15 / $0.80 | ||
| 4 | Meituan | 560B | 128K | $0.30 / $1.20 | ||
| 5 | Moonshot AI | 1.0T | 200K | $0.50 / $0.50 | ||
| 5 | Moonshot AI | 1.0T | — | — | ||
| 7 | MiniMax | 456B | 1.0M | $0.55 / $2.20 | ||
| 8 | MiniMax | 456B | — | — |
Notice missing or incorrect data?
FAQ
Common questions about ZebraLogic
ZebraLogic is an evaluation framework for assessing large language models' logical reasoning capabilities through logic grid puzzles derived from constraint satisfaction problems (CSPs). The benchmark consists of 1,000 programmatically generated puzzles with controllable and quantifiable complexity, revealing a 'curse of complexity' where model accuracy declines significantly as problem complexity grows.
The ZebraLogic paper is available at https://arxiv.org/abs/2502.01100. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The ZebraLogic leaderboard ranks 8 AI models based on their performance on this benchmark. Currently, Qwen3 VL 235B A22B Thinking by Alibaba Cloud / Qwen Team leads with a score of 0.973. The average score across all models is 0.903.
The highest ZebraLogic score is 0.973, achieved by Qwen3 VL 235B A22B Thinking from Alibaba Cloud / Qwen Team.
8 models have been evaluated on the ZebraLogic benchmark, with 0 verified results and 8 self-reported results.
ZebraLogic is categorized under reasoning. The benchmark evaluates text models.