IFEval
Instruction-Following Evaluation (IFEval) benchmark for large language models, focusing on verifiable instructions with 25 types of instructions and around 500 prompts containing one or more verifiable constraints
Progress Over Time
Interactive timeline showing model performance evolution on IFEval
State-of-the-art frontier
Open
Proprietary
IFEval Leaderboard
63 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Alibaba Cloud / Qwen Team | 27B | 262K | $0.30 / $2.40 | ||
| 2 | Alibaba Cloud / Qwen Team | — | — | — | ||
| 3 | OpenAI | — | 200K | $1.10 / $4.40 | ||
| 4 | Alibaba Cloud / Qwen Team | 122B | 262K | $0.40 / $3.20 | ||
| 5 | Anthropic | — | 200K | $3.00 / $15.00 | ||
| 6 | Alibaba Cloud / Qwen Team | 397B | 262K | $0.60 / $3.60 | ||
| 7 | Amazon | — | 300K | $0.80 / $3.20 | ||
| 7 | 70B | 128K | $0.20 / $0.20 | |||
| 9 | Alibaba Cloud / Qwen Team | 35B | 262K | $0.25 / $2.00 | ||
| 10 | Alibaba Cloud / Qwen Team | 9B | — | — | ||
| 11 | Google | 27B | 131K | $0.10 / $0.20 | ||
| 12 | NVIDIA | 9B | — | — | ||
| 13 | Google | 4B | 131K | $0.02 / $0.04 | ||
| 14 | Moonshot AI | 1.0T | — | — | ||
| 14 | Moonshot AI | 1.0T | 200K | $0.50 / $0.50 | ||
| 14 | Alibaba Cloud / Qwen Team | 4B | — | — | ||
| 17 | Amazon | — | 300K | $0.06 / $0.24 | ||
| 18 | Meituan | 560B | 128K | $0.30 / $1.20 | ||
| 19 | 253B | — | — | |||
| 20 | Alibaba Cloud / Qwen Team | 80B | 66K | $0.15 / $1.50 | ||
| 20 | Google | 12B | 131K | $0.05 / $0.10 | ||
| 22 | Alibaba Cloud / Qwen Team | 235B | 262K | $0.15 / $0.80 | ||
| 23 | 405B | 128K | $0.89 / $0.89 | |||
| 24 | OpenAI | — | 128K | $75.00 / $150.00 | ||
| 24 | Alibaba Cloud / Qwen Team | 236B | 262K | $0.45 / $3.49 | ||
| 26 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 26 | Alibaba Cloud / Qwen Team | 235B | 262K | $0.30 / $3.00 | ||
| 26 | Alibaba Cloud / Qwen Team | 236B | 262K | $0.30 / $1.50 | ||
| 29 | Alibaba Cloud / Qwen Team | 80B | 66K | $0.15 / $1.50 | ||
| 30 | 70B | 128K | $0.20 / $0.20 | |||
| 31 | OpenAI | — | 1.0M | $2.00 / $8.00 | ||
| 32 | Moonshot AI | — | — | — | ||
| 32 | Amazon | — | 128K | $0.03 / $0.14 | ||
| 34 | DeepSeek | 671B | 131K | $0.27 / $1.10 | ||
| 35 | Alibaba Cloud / Qwen Team | 31B | 262K | $0.20 / $0.70 | ||
| 36 | Microsoft | 14B | — | — | ||
| 37 | Sarvam AI | 105B | — | — | ||
| 38 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 39 | OpenAI | — | 1.0M | $0.40 / $1.60 | ||
| 39 | Alibaba Cloud / Qwen Team | 73B | 131K | $0.35 / $0.40 | ||
| 41 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 42 | Alibaba Cloud / Qwen Team | 9B | 262K | $0.08 / $0.50 | ||
| 43 | Microsoft | 14B | — | — | ||
| 44 | Alibaba Cloud / Qwen Team | 9B | 262K | $0.18 / $2.09 | ||
| 45 | Mistral AI | 24B | 32K | $0.07 / $0.14 | ||
| 46 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 / $1.00 | ||
| 47 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 / $0.60 | ||
| 48 | Alibaba Cloud / Qwen Team | 31B | 262K | $0.20 / $1.00 | ||
| 49 | OpenAI | — | 128K | $2.50 / $10.00 | ||
| 50 | 8B | 131K | $0.03 / $0.03 |
1–50 of 63
1/2
Notice missing or incorrect data?
FAQ
Common questions about IFEval
Instruction-Following Evaluation (IFEval) benchmark for large language models, focusing on verifiable instructions with 25 types of instructions and around 500 prompts containing one or more verifiable constraints
The IFEval paper is available at https://arxiv.org/abs/2311.07911. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The IFEval leaderboard ranks 63 AI models based on their performance on this benchmark. Currently, Qwen3.5-27B by Alibaba Cloud / Qwen Team leads with a score of 0.950. The average score across all models is 0.843.
The highest IFEval score is 0.950, achieved by Qwen3.5-27B from Alibaba Cloud / Qwen Team.
63 models have been evaluated on the IFEval benchmark, with 0 verified results and 63 self-reported results.
IFEval is categorized under general, instruction following, and structured output. The benchmark evaluates text models.