AttaQ

AttaQ is a unique dataset containing adversarial examples in the form of questions designed to provoke harmful or inappropriate responses from large language models. The benchmark evaluates safety vulnerabilities by using specialized clustering techniques that analyze both the semantic similarity of input attacks and the harmfulness of model responses, facilitating targeted improvements to model safety mechanisms.

Granite 3.3 8B Base from IBM currently leads the AttaQ leaderboard with a score of 0.885 across 3 evaluated AI models.

Paper
About this benchmark

What AttaQ measures

AttaQ is a text benchmark that evaluates large language models on safety tasks. LLM Stats tracks 3 models on this benchmark, with a maximum possible score of 1. Current average across reported models is 0.9, with the leader reaching 0.9.

Compare leaders on the best AI for safety leaderboards.

Publication

Paper
Unveiling Safety Vulnerabilities of Large Language Models
Authors
George Kour, Marcel Zalmanovici, Naama Zwerdling, Esther Goldbraich, and 4 others
Published

Abstract

As large language models become more prevalent, their possible harmful or inappropriate responses are a cause for concern. This paper introduces a unique dataset containing adversarial examples in the form of questions, which we call AttaQ, designed to provoke such harmful or inappropriate responses. We assess the efficacy of our dataset by analyzing the vulnerabilities of various models when subjected to it. Additionally, we introduce a novel automatic approach for identifying and naming vulnerable semantic regions - input semantic areas for which the model is likely to produce harmful outputs. This is achieved through the application of specialized clustering techniques that consider both the semantic similarity of the input attacks and the harmfulness of the model's responses. Automatically identifying vulnerable semantic regions enhances the evaluation of model weaknesses, facilitating targeted improvements to its safety mechanisms and overall reliability.

IBMGranite 3.3 8B Base leads with 88.5%, followed by IBMGranite 3.3 8B Instruct at 88.5% and IBMIBM Granite 4.0 Tiny Preview at 86.1%.

Progress Over Time

Interactive timeline showing model performance evolution on AttaQ

State-of-the-art frontier
Open
Proprietary

AttaQ Leaderboard

3 models
ContextCostLicense
18B
18B
37B
Notice missing or incorrect data?

FAQ

Common questions about AttaQ.

What is the AttaQ benchmark?

AttaQ is a unique dataset containing adversarial examples in the form of questions designed to provoke harmful or inappropriate responses from large language models. The benchmark evaluates safety vulnerabilities by using specialized clustering techniques that analyze both the semantic similarity of input attacks and the harmfulness of model responses, facilitating targeted improvements to model safety mechanisms.

What is the AttaQ leaderboard?

The AttaQ leaderboard ranks 3 AI models based on their performance on this benchmark. Currently, Granite 3.3 8B Base by IBM leads with a score of 0.885. The average score across all models is 0.877.

What is the highest AttaQ score?

The highest AttaQ score is 0.885, achieved by Granite 3.3 8B Base from IBM.

How many models are evaluated on AttaQ?

3 models have been evaluated on the AttaQ benchmark, with 0 verified results and 3 self-reported results.

Where can I find the AttaQ paper?

The AttaQ paper is available at https://arxiv.org/abs/2311.04124. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does AttaQ cover?

AttaQ is categorized under safety. The benchmark evaluates text models.

What is the best open-source model on AttaQ?

Granite 3.3 8B Base by IBM is the top-ranked open-source model on AttaQ, with a score of 0.885 (rank #1).

How recent are the AttaQ leaderboard results?

The AttaQ leaderboard was last updated in June 2026 and currently includes 3 evaluated models.

More evaluations to explore

Related benchmarks in the same category

View all safety
CyberGym

CyberGym is a benchmark for evaluating AI agents on cybersecurity tasks, testing their ability to identify vulnerabilities, perform security analysis, and complete security-related challenges in a controlled environment.

safety
7 models
Cybersecurity CTFs

Cybersecurity Capture the Flag (CTF) benchmark for evaluating LLMs in offensive security challenges. Contains diverse cybersecurity tasks including cryptography, web exploitation, binary analysis, and forensics to assess AI capabilities in cybersecurity problem-solving.

safety
3 models
FigQA

FigQA is a multiple-choice benchmark on interpreting scientific figures from biology papers. It evaluates dual-use biological knowledge and multimodal reasoning relevant to bioweapons development.

safetymultimodal
3 models
XSTest

XSTest is a test suite designed to identify exaggerated safety behaviours in large language models. It comprises 450 prompts: 250 safe prompts across ten prompt types that well-calibrated models should not refuse to comply with, and 200 unsafe prompts as contrasts that models should refuse. The benchmark systematically evaluates whether models refuse to respond to clearly safe prompts due to overly cautious safety mechanisms.

safety
3 models
CyBench

CyBench is a suite of Capture-the-Flag (CTF) challenges measuring agentic cyber attack capabilities. It evaluates dual-use cybersecurity knowledge and measures the 'unguided success rate', where agents complete tasks end-to-end without guidance on appropriate subtasks.

safety
2 models
POPE

Polling-based Object Probing Evaluation (POPE) is a benchmark for evaluating object hallucination in Large Vision-Language Models (LVLMs). POPE addresses the problem where LVLMs generate objects inconsistent with target images by using a polling-based query method that asks yes/no questions about object presence in images, providing more stable and flexible evaluation of object hallucination.

safetymultimodal
2 models