XSTest
XSTest is a test suite designed to identify exaggerated safety behaviours in large language models. It comprises 450 prompts: 250 safe prompts across ten prompt types that well-calibrated models should not refuse to comply with, and 200 unsafe prompts as contrasts that models should refuse. The benchmark systematically evaluates whether models refuse to respond to clearly safe prompts due to overly cautious safety mechanisms.
Gemini 1.5 Pro from Google currently leads the XSTest leaderboard with a score of 0.988 across 3 evaluated AI models.
What XSTest measures
XSTest is a text benchmark that evaluates large language models on safety tasks. LLM Stats tracks 3 models on this benchmark, with a maximum possible score of 1. Current average across reported models is 1.0, with the leader reaching 1.0.
Compare leaders on the best AI for safety leaderboards.
Publication
- Paper
- XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
- Authors
- Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, and 2 others
- Published
- arXiv
- 2308.01263
Abstract
Without proper safeguards, large language models will readily follow malicious instructions and generate toxic content. This risk motivates safety efforts such as red-teaming and large-scale feedback learning, which aim to make models both helpful and harmless. However, there is a tension between these two objectives, since harmlessness requires models to refuse to comply with unsafe prompts, and thus not be helpful. Recent anecdotal evidence suggests that some models may have struck a poor balance, so that even clearly safe prompts are refused if they use similar language to unsafe prompts or mention sensitive topics. In this paper, we introduce a new test suite called XSTest to identify such eXaggerated Safety behaviours in a systematic way. XSTest comprises 250 safe prompts across ten prompt types that well-calibrated models should not refuse to comply with, and 200 unsafe prompts as contrasts that models, for most applications, should refuse. We describe XSTest's creation and composition, and then use the test suite to highlight systematic failure modes in state-of-the-art language models as well as more general challenges in building safer language models.
Gemini 1.5 Pro leads with 98.8%, followed by
Gemini 1.5 Flash at 97.0% and
Gemini 1.5 Flash 8B at 92.6%.
Progress Over Time
Interactive timeline showing model performance evolution on XSTest
XSTest Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Google | — | — | — | ||
| 2 | Google | — | — | — | ||
| 3 | Google | 8B | — | — |
FAQ
Common questions about XSTest.
More evaluations to explore
Related benchmarks in the same category
CyberGym is a benchmark for evaluating AI agents on cybersecurity tasks, testing their ability to identify vulnerabilities, perform security analysis, and complete security-related challenges in a controlled environment.
AttaQ is a unique dataset containing adversarial examples in the form of questions designed to provoke harmful or inappropriate responses from large language models. The benchmark evaluates safety vulnerabilities by using specialized clustering techniques that analyze both the semantic similarity of input attacks and the harmfulness of model responses, facilitating targeted improvements to model safety mechanisms.
Cybersecurity Capture the Flag (CTF) benchmark for evaluating LLMs in offensive security challenges. Contains diverse cybersecurity tasks including cryptography, web exploitation, binary analysis, and forensics to assess AI capabilities in cybersecurity problem-solving.
FigQA is a multiple-choice benchmark on interpreting scientific figures from biology papers. It evaluates dual-use biological knowledge and multimodal reasoning relevant to bioweapons development.
CyBench is a suite of Capture-the-Flag (CTF) challenges measuring agentic cyber attack capabilities. It evaluates dual-use cybersecurity knowledge and measures the 'unguided success rate', where agents complete tasks end-to-end without guidance on appropriate subtasks.
Polling-based Object Probing Evaluation (POPE) is a benchmark for evaluating object hallucination in Large Vision-Language Models (LVLMs). POPE addresses the problem where LVLMs generate objects inconsistent with target images by using a polling-based query method that asks yes/no questions about object presence in images, providing more stable and flexible evaluation of object hallucination.