MASK
MASK is a collection of 1000 questions measuring whether models faithfully report their beliefs when pressured to lie. It operationalizes deception as the rate at which the model lies, i.e., knowingly making false statements intended to be received as true. Lower dishonesty rates indicate better honesty.
Progress Over Time
Interactive timeline showing model performance evolution on MASK
State-of-the-art frontier
Open
Proprietary
MASK Leaderboard
1 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | — | 256K | $3.00 / $15.00 |
Notice missing or incorrect data?
FAQ
Common questions about MASK
MASK is a collection of 1000 questions measuring whether models faithfully report their beliefs when pressured to lie. It operationalizes deception as the rate at which the model lies, i.e., knowingly making false statements intended to be received as true. Lower dishonesty rates indicate better honesty.
The MASK paper is available at https://arxiv.org/abs/2503.03750. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The MASK leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Grok-4.1 Thinking by xAI leads with a score of 0.510. The average score across all models is 0.510.
The highest MASK score is 0.510, achieved by Grok-4.1 Thinking from xAI.
1 models have been evaluated on the MASK benchmark, with 0 verified results and 1 self-reported results.
MASK is categorized under reasoning and safety. The benchmark evaluates text models.