IFEval

Instruction-Following Evaluation (IFEval) benchmark for large language models, focusing on verifiable instructions with 25 types of instructions and around 500 prompts containing one or more verifiable constraints

Paper

Progress Over Time

Interactive timeline showing model performance evolution on IFEval

State-of-the-art frontier
Open
Proprietary

IFEval Leaderboard

63 models
ContextCostLicense
1
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
27B262K$0.30 / $2.40
2
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
3
OpenAI
OpenAI
200K$1.10 / $4.40
4
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
122B262K$0.40 / $3.20
5200K$3.00 / $15.00
6
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
397B262K$0.60 / $3.60
7
Amazon
Amazon
300K$0.80 / $3.20
770B128K$0.20 / $0.20
9
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B262K$0.25 / $2.00
10
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B
1127B131K$0.10 / $0.20
129B
134B131K$0.02 / $0.04
141.0T
14
Moonshot AI
Moonshot AI
1.0T200K$0.50 / $0.50
14
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B
17
Amazon
Amazon
300K$0.06 / $0.24
18560B128K$0.30 / $1.20
19253B
20
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
80B66K$0.15 / $1.50
2012B131K$0.05 / $0.10
22
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B262K$0.15 / $0.80
23405B128K$0.89 / $0.89
24
OpenAI
OpenAI
128K$75.00 / $150.00
24
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B262K$0.45 / $3.49
26
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
26
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B262K$0.30 / $3.00
26
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B262K$0.30 / $1.50
29
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
80B66K$0.15 / $1.50
3070B128K$0.20 / $0.20
31
OpenAI
OpenAI
1.0M$2.00 / $8.00
32
Moonshot AI
Moonshot AI
32128K$0.03 / $0.14
34
DeepSeek
DeepSeek
671B131K$0.27 / $1.10
35
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B262K$0.20 / $0.70
3614B
37
Sarvam AI
Sarvam AI
105B
38
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
391.0M$0.40 / $1.60
39
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
73B131K$0.35 / $0.40
41
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
42
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B262K$0.08 / $0.50
4314B
44
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B262K$0.18 / $2.09
4524B32K$0.07 / $0.14
46
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B262K$0.10 / $1.00
47
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B262K$0.10 / $0.60
48
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B262K$0.20 / $1.00
49
OpenAI
OpenAI
128K$2.50 / $10.00
508B131K$0.03 / $0.03
150 of 63
1/2
Notice missing or incorrect data?

FAQ

Common questions about IFEval

Instruction-Following Evaluation (IFEval) benchmark for large language models, focusing on verifiable instructions with 25 types of instructions and around 500 prompts containing one or more verifiable constraints
The IFEval paper is available at https://arxiv.org/abs/2311.07911. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The IFEval leaderboard ranks 63 AI models based on their performance on this benchmark. Currently, Qwen3.5-27B by Alibaba Cloud / Qwen Team leads with a score of 0.950. The average score across all models is 0.843.
The highest IFEval score is 0.950, achieved by Qwen3.5-27B from Alibaba Cloud / Qwen Team.
63 models have been evaluated on the IFEval benchmark, with 0 verified results and 63 self-reported results.
IFEval is categorized under general, instruction following, and structured output. The benchmark evaluates text models.