AdvancedIF

Name: AdvancedIF Leaderboard — AI Model Scores
Creator: LLM Stats
License: https://llm-stats.com/legal/terms-of-service

Paper

Progress Over Time

Interactive timeline showing model performance evolution on AdvancedIF

State-of-the-art frontier

Open

Proprietary

AdvancedIF Leaderboard

2 models

				Context	Cost	License
1	MAI-Thinking-1 Microsoft		1.0T	—	—
2	MAI-Code-1-Flash Microsoft		—	—	—

Notice missing or incorrect data?

About this benchmark

What is AdvancedIF?

AdvancedIF is a rubric-based benchmark measuring complex, multi-turn, and system-prompted instruction following ability, scored with a calibrated LLM judge against per-instruction rubrics.

AdvancedIF is a text benchmark evaluating models on reasoning, instruction following, and general tasks. LLM Stats tracks 2 models on this benchmark, scored on a 0–1 scale. The current average is 0.8, with the leader at 0.8.

Compare leaders on the best AI for reasoning, best AI for instruction following and best AI for general leaderboards.

Current leaders

MAI-Thinking-1 from Microsoft currently leads the AdvancedIF leaderboard with a score of 0.850 across 2 evaluated AI models.

MAI-Thinking-1Microsoft85.0%

MAI-Code-1-FlashMicrosoft71.4%

Source paper

Title: AdvancedIF: Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following
Authors: Yun He, Wenzhe Li, Hejia Zhang, Songlin Li, and 21 others
Published: November 13, 2025
arXiv: 2511.10507

Abstract

Recent progress in large language models (LLMs) has led to impressive performance on a range of tasks, yet advanced instruction following (IF)-especially for complex, multi-turn, and system-prompted instructions-remains a significant challenge. Rigorous evaluation and effective training for such capabilities are hindered by the lack of high-quality, human-annotated benchmarks and reliable, interpretable reward signals. In this work, we introduce AdvancedIF (we will release this benchmark soon), a comprehensive benchmark featuring over 1,600 prompts and expert-curated rubrics that assess LLMs ability to follow complex, multi-turn, and system-level instructions. We further propose RIFL (Rubric-based Instruction-Following Learning), a novel post-training pipeline that leverages rubric generation, a finetuned rubric verifier, and reward shaping to enable effective reinforcement learning for instruction following. Extensive experiments demonstrate that RIFL substantially improves the instruction-following abilities of LLMs, achieving a 6.7% absolute gain on AdvancedIF and strong results on public benchmarks. Our ablation studies confirm the effectiveness of each component in RIFL. This work establishes rubrics as a powerful tool for both training and evaluating advanced IF in LLMs, paving the way for more capable and reliable AI systems.

FAQ

Common questions about the AdvancedIF benchmark and leaderboard.

What is the AdvancedIF benchmark?

AdvancedIF is a rubric-based benchmark measuring complex, multi-turn, and system-prompted instruction following ability, scored with a calibrated LLM judge against per-instruction rubrics.

What is the AdvancedIF leaderboard?

The AdvancedIF leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, MAI-Thinking-1 by Microsoft leads with a score of 0.850. The average score across all models is 0.782.

What is the highest AdvancedIF score?

The highest AdvancedIF score is 0.850, achieved by MAI-Thinking-1 from Microsoft.

How many models are evaluated on AdvancedIF?

2 models have been evaluated on the AdvancedIF benchmark, with 0 verified results and 2 self-reported results.

Where can I find the AdvancedIF paper?

The AdvancedIF paper is available at https://arxiv.org/abs/2511.10507. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does AdvancedIF cover?

AdvancedIF is categorized under reasoning, instruction following, and general. The benchmark evaluates text models.

How recent are the AdvancedIF leaderboard results?

The AdvancedIF leaderboard was last updated in July 2026 and currently includes 2 evaluated models.