AIR-Bench

Name: AIR-Bench Leaderboard — AI Model Scores
Creator: LLM Stats
License: https://llm-stats.com/legal/terms-of-service

Paper

Progress Over Time

Interactive timeline showing model performance evolution on AIR-Bench

State-of-the-art frontier

Open

Proprietary

AIR-Bench Leaderboard

1 models

				Context	Cost	License
1	MAI-Thinking-1 Microsoft		1.0T	—	—

Notice missing or incorrect data?

About this benchmark

What is AIR-Bench?

AIR-Bench 2024 is a safety benchmark grounded in risk categories derived from government regulations and company policies. It evaluates policy-grounded refusal across a broad regulatory and policy-derived harm taxonomy, using category-specific LLM-judge prompts that reward safe engagement rather than only penalizing unsafe responses.

AIR-Bench is a text benchmark evaluating models on safety tasks. LLM Stats tracks 1 models on this benchmark, scored on a 0–1 scale. The current average is 0.9, with the leader at 0.9.

Compare leaders on the best AI for safety leaderboards.

Current leaders

MAI-Thinking-1 from Microsoft currently leads the AIR-Bench leaderboard with a score of 0.880 across 1 evaluated AI models.

MAI-Thinking-1Microsoft88.0%

Source paper

Title: AIR-Bench 2024: A Safety Benchmark Based on Risk Categories from Regulations and Policies
Authors: Yi Zeng, Yu Yang, Andy Zhou, Jeffrey Ziwei Tan, and 8 others
Published: July 11, 2024
arXiv: 2407.17436

Abstract

Foundation models (FMs) provide societal benefits but also amplify risks. Governments, companies, and researchers have proposed regulatory frameworks, acceptable use policies, and safety benchmarks in response. However, existing public benchmarks often define safety categories based on previous literature, intuitions, or common sense, leading to disjointed sets of categories for risks specified in recent regulations and policies, which makes it challenging to evaluate and compare FMs across these benchmarks. To bridge this gap, we introduce AIR-Bench 2024, the first AI safety benchmark aligned with emerging government regulations and company policies, following the regulation-based safety categories grounded in our AI risks study, AIR 2024. AIR 2024 decomposes 8 government regulations and 16 company policies into a four-tiered safety taxonomy with 314 granular risk categories in the lowest tier. AIR-Bench 2024 contains 5,694 diverse prompts spanning these categories, with manual curation and human auditing to ensure quality. We evaluate leading language models on AIR-Bench 2024, uncovering insights into their alignment with specified safety concerns. By bridging the gap between public benchmarks and practical AI risks, AIR-Bench 2024 provides a foundation for assessing model safety across jurisdictions, fostering the development of safer and more responsible AI systems.

FAQ

Common questions about the AIR-Bench benchmark and leaderboard.

What is the AIR-Bench benchmark?

What is the AIR-Bench leaderboard?

The AIR-Bench leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, MAI-Thinking-1 by Microsoft leads with a score of 0.880. The average score across all models is 0.880.

What is the highest AIR-Bench score?

The highest AIR-Bench score is 0.880, achieved by MAI-Thinking-1 from Microsoft.

How many models are evaluated on AIR-Bench?

1 models have been evaluated on the AIR-Bench benchmark, with 0 verified results and 1 self-reported results.

Where can I find the AIR-Bench paper?

The AIR-Bench paper is available at https://arxiv.org/abs/2407.17436. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does AIR-Bench cover?

AIR-Bench is categorized under safety. The benchmark evaluates text models.

How recent are the AIR-Bench leaderboard results?

The AIR-Bench leaderboard was last updated in July 2026 and currently includes 1 evaluated models.