IFEval

Instruction-Following Evaluation (IFEval) benchmark for large language models, focusing on verifiable instructions with 25 types of instructions and around 500 prompts containing one or more verifiable constraints

Qwen3.5-27B from Alibaba Cloud / Qwen Team currently leads the IFEval leaderboard with a score of 0.950 across 64 evaluated AI models.

Paper
About this benchmark

What IFEval measures

IFEval is a text benchmark that evaluates large language models on general, instruction following, and structured output tasks. LLM Stats tracks 64 models on this benchmark, with a maximum possible score of 1. Current average across reported models is 0.8, with the leader reaching 0.9.

Compare leaders on the best AI for general, best AI for instruction following and best AI for structured output leaderboards.

Publication

Paper
Instruction-Following Evaluation for Large Language Models
Authors
Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, and 4 others
Published

Abstract

One core capability of Large Language Models (LLMs) is to follow natural language instructions. However, the evaluation of such abilities is not standardized: Human evaluations are expensive, slow, and not objectively reproducible, while LLM-based auto-evaluation is potentially biased or limited by the ability of the evaluator LLM. To overcome these issues, we introduce Instruction-Following Eval (IFEval) for large language models. IFEval is a straightforward and easy-to-reproduce evaluation benchmark. It focuses on a set of "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times". We identified 25 types of those verifiable instructions and constructed around 500 prompts, with each prompt containing one or more verifiable instructions. We show evaluation results of two widely available LLMs on the market. Our code and data can be found at https://github.com/google-research/google-research/tree/master/instruction_following_eval

Alibaba Cloud / Qwen TeamQwen3.5-27B leads with 95.0%, followed by Alibaba Cloud / Qwen TeamQwen3.7 Max at 94.3% and Alibaba Cloud / Qwen TeamQwen3.6 Plus at 94.3%.

Progress Over Time

Interactive timeline showing model performance evolution on IFEval

State-of-the-art frontier
Open
Proprietary

IFEval Leaderboard

64 models
ContextCostLicense
1
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
27B262K$0.30 / $2.40
2
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
1.0M$1.25 / $3.75
2
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
1.0M$0.50 / $3.00
4
OpenAI
OpenAI
5
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
122B262K$0.40 / $3.20
6
7
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
397B262K$0.60 / $3.60
870B
8
Amazon
Amazon
10
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B262K$0.25 / $2.00
11
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B
1227B
139B
144B
151.0T
15
Moonshot AI
Moonshot AI
1.0T
15
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B
18
Amazon
Amazon
19560B128K$0.30 / $1.20
20253B
21
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
80B
2112B
23
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B
24405B
25
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B
25
OpenAI
OpenAI
27
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
27
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B
27
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B262K$0.30 / $1.50
30
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
80B
3170B
32
OpenAI
OpenAI
1.0M$2.00 / $8.00
33
Moonshot AI
Moonshot AI
33
35
DeepSeek
DeepSeek
671B
36
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B
3714B
38
Sarvam AI
Sarvam AI
105B
39
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
401.0M$0.40 / $1.60
40
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
73B
42
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
43
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B262K$0.08 / $0.50
4414B
45
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B262K$0.18 / $2.09
4624B
47
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B262K$0.10 / $1.00
48
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B262K$0.10 / $0.60
49
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B
50
OpenAI
OpenAI
128K$2.50 / $10.00
150 of 64
1/2
Notice missing or incorrect data?

FAQ

Common questions about IFEval.

What is the IFEval benchmark?

Instruction-Following Evaluation (IFEval) benchmark for large language models, focusing on verifiable instructions with 25 types of instructions and around 500 prompts containing one or more verifiable constraints

What is the IFEval leaderboard?

The IFEval leaderboard ranks 64 AI models based on their performance on this benchmark. Currently, Qwen3.5-27B by Alibaba Cloud / Qwen Team leads with a score of 0.950. The average score across all models is 0.844.

What is the highest IFEval score?

The highest IFEval score is 0.950, achieved by Qwen3.5-27B from Alibaba Cloud / Qwen Team.

How many models are evaluated on IFEval?

64 models have been evaluated on the IFEval benchmark, with 0 verified results and 64 self-reported results.

Where can I find the IFEval paper?

The IFEval paper is available at https://arxiv.org/abs/2311.07911. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does IFEval cover?

IFEval is categorized under general, instruction following, and structured output. The benchmark evaluates text models.

What is the best open-source model on IFEval?

Qwen3.5-27B by Alibaba Cloud / Qwen Team is the top-ranked open-source model on IFEval, with a score of 0.950 (rank #1).

Which model offers the best value on IFEval?

Among models scoring within 10% of the leader, Qwen3.5-35B-A3B from Alibaba Cloud / Qwen Team is the cheapest, at $0.25 per million input tokens with a score of 0.919.

How recent are the IFEval leaderboard results?

The IFEval leaderboard was last updated in June 2026 and currently includes 64 evaluated models.

More evaluations to explore

Related benchmarks in the same category

View all general
GPQA

A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, with PhD experts reaching 65% accuracy.

general
216 models
MMLU-Pro

A more robust and challenging multi-task language understanding benchmark that extends MMLU by expanding multiple-choice options from 4 to 10, eliminating trivial questions, and focusing on reasoning-intensive tasks. Features over 12,000 curated questions across 14 domains and causes a 16-33% accuracy drop compared to original MMLU.

general
120 models
MMLU

Massive Multitask Language Understanding benchmark testing knowledge across 57 diverse subjects including STEM, humanities, social sciences, and professional domains

general
99 models
LiveCodeBench

LiveCodeBench is a holistic and contamination-free evaluation benchmark for large language models for code. It continuously collects new problems from programming contests (LeetCode, AtCoder, CodeForces) and evaluates four different scenarios: code generation, self-repair, code execution, and test output prediction. Problems are annotated with release dates to enable evaluation on unseen problems released after a model's training cutoff.

general
71 models
MMMU

MMMU (Massive Multi-discipline Multimodal Understanding) is a benchmark designed to evaluate multimodal models on college-level subject knowledge and deliberate reasoning. Contains 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering across 30 subjects and 183 subfields.

generalmultimodal
62 models
MMMU-Pro

A more robust multi-discipline multimodal understanding benchmark that enhances MMMU through a three-step process: filtering text-only answerable questions, augmenting candidate options, and introducing vision-only input settings. Achieves significantly lower model performance (16.8-26.9%) compared to original MMMU, providing more rigorous evaluation that closely mimics real-world scenarios.

generalmultimodal
49 models