AIR-Bench

AIR-Bench 2024 is a safety benchmark grounded in risk categories derived from government regulations and company policies. It evaluates policy-grounded refusal across a broad regulatory and policy-derived harm taxonomy, using category-specific LLM-judge prompts that reward safe engagement rather than only penalizing unsafe responses.

MAI-Thinking-1 from Microsoft currently leads the AIR-Bench leaderboard with a score of 0.880 across 1 evaluated AI models.

Paper
About this benchmark

What AIR-Bench measures

AIR-Bench is a text benchmark that evaluates large language models on safety tasks. LLM Stats tracks 1 model on this benchmark, with a maximum possible score of 1. Current average across reported models is 0.9, with the leader reaching 0.9.

Compare leaders on the best AI for safety leaderboards.

Publication

Paper
AIR-Bench 2024: A Safety Benchmark Based on Risk Categories from Regulations and Policies
Authors
Yi Zeng, Yu Yang, Andy Zhou, Jeffrey Ziwei Tan, and 8 others
Published

Abstract

Foundation models (FMs) provide societal benefits but also amplify risks. Governments, companies, and researchers have proposed regulatory frameworks, acceptable use policies, and safety benchmarks in response. However, existing public benchmarks often define safety categories based on previous literature, intuitions, or common sense, leading to disjointed sets of categories for risks specified in recent regulations and policies, which makes it challenging to evaluate and compare FMs across these benchmarks. To bridge this gap, we introduce AIR-Bench 2024, the first AI safety benchmark aligned with emerging government regulations and company policies, following the regulation-based safety categories grounded in our AI risks study, AIR 2024. AIR 2024 decomposes 8 government regulations and 16 company policies into a four-tiered safety taxonomy with 314 granular risk categories in the lowest tier. AIR-Bench 2024 contains 5,694 diverse prompts spanning these categories, with manual curation and human auditing to ensure quality. We evaluate leading language models on AIR-Bench 2024, uncovering insights into their alignment with specified safety concerns. By bridging the gap between public benchmarks and practical AI risks, AIR-Bench 2024 provides a foundation for assessing model safety across jurisdictions, fostering the development of safer and more responsible AI systems.

MicrosoftMAI-Thinking-1 leads with 88.0%.

Progress Over Time

Interactive timeline showing model performance evolution on AIR-Bench

State-of-the-art frontier
Open
Proprietary

AIR-Bench Leaderboard

1 models
ContextCostLicense
1
Microsoft
Microsoft
1.0T
Notice missing or incorrect data?

FAQ

Common questions about AIR-Bench.

What is the AIR-Bench benchmark?

AIR-Bench 2024 is a safety benchmark grounded in risk categories derived from government regulations and company policies. It evaluates policy-grounded refusal across a broad regulatory and policy-derived harm taxonomy, using category-specific LLM-judge prompts that reward safe engagement rather than only penalizing unsafe responses.

What is the AIR-Bench leaderboard?

The AIR-Bench leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, MAI-Thinking-1 by Microsoft leads with a score of 0.880. The average score across all models is 0.880.

What is the highest AIR-Bench score?

The highest AIR-Bench score is 0.880, achieved by MAI-Thinking-1 from Microsoft.

How many models are evaluated on AIR-Bench?

1 models have been evaluated on the AIR-Bench benchmark, with 0 verified results and 1 self-reported results.

Where can I find the AIR-Bench paper?

The AIR-Bench paper is available at https://arxiv.org/abs/2407.17436. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does AIR-Bench cover?

AIR-Bench is categorized under safety. The benchmark evaluates text models.

How recent are the AIR-Bench leaderboard results?

The AIR-Bench leaderboard was last updated in June 2026 and currently includes 1 evaluated models.

More evaluations to explore

Related benchmarks in the same category

View all safety
CyberGym

CyberGym is a benchmark for evaluating AI agents on cybersecurity tasks, testing their ability to identify vulnerabilities, perform security analysis, and complete security-related challenges in a controlled environment.

safety
7 models
AttaQ

AttaQ is a unique dataset containing adversarial examples in the form of questions designed to provoke harmful or inappropriate responses from large language models. The benchmark evaluates safety vulnerabilities by using specialized clustering techniques that analyze both the semantic similarity of input attacks and the harmfulness of model responses, facilitating targeted improvements to model safety mechanisms.

safety
3 models
Cybersecurity CTFs

Cybersecurity Capture the Flag (CTF) benchmark for evaluating LLMs in offensive security challenges. Contains diverse cybersecurity tasks including cryptography, web exploitation, binary analysis, and forensics to assess AI capabilities in cybersecurity problem-solving.

safety
3 models
FigQA

FigQA is a multiple-choice benchmark on interpreting scientific figures from biology papers. It evaluates dual-use biological knowledge and multimodal reasoning relevant to bioweapons development.

safetymultimodal
3 models
XSTest

XSTest is a test suite designed to identify exaggerated safety behaviours in large language models. It comprises 450 prompts: 250 safe prompts across ten prompt types that well-calibrated models should not refuse to comply with, and 200 unsafe prompts as contrasts that models should refuse. The benchmark systematically evaluates whether models refuse to respond to clearly safe prompts due to overly cautious safety mechanisms.

safety
3 models
CyBench

CyBench is a suite of Capture-the-Flag (CTF) challenges measuring agentic cyber attack capabilities. It evaluates dual-use cybersecurity knowledge and measures the 'unguided success rate', where agents complete tasks end-to-end without guidance on appropriate subtasks.

safety
2 models