IFEval
Instruction-Following Evaluation (IFEval) benchmark for large language models, focusing on verifiable instructions with 25 types of instructions and around 500 prompts containing one or more verifiable constraints
Qwen3.5-27B from Alibaba Cloud / Qwen Team currently leads the IFEval leaderboard with a score of 0.950 across 64 evaluated AI models.
What IFEval measures
IFEval is a text benchmark that evaluates large language models on general, instruction following, and structured output tasks. LLM Stats tracks 64 models on this benchmark, with a maximum possible score of 1. Current average across reported models is 0.8, with the leader reaching 0.9.
Compare leaders on the best AI for general, best AI for instruction following and best AI for structured output leaderboards.
Publication
- Paper
- Instruction-Following Evaluation for Large Language Models
- Authors
- Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, and 4 others
- Published
- arXiv
- 2311.07911
Abstract
One core capability of Large Language Models (LLMs) is to follow natural language instructions. However, the evaluation of such abilities is not standardized: Human evaluations are expensive, slow, and not objectively reproducible, while LLM-based auto-evaluation is potentially biased or limited by the ability of the evaluator LLM. To overcome these issues, we introduce Instruction-Following Eval (IFEval) for large language models. IFEval is a straightforward and easy-to-reproduce evaluation benchmark. It focuses on a set of "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times". We identified 25 types of those verifiable instructions and constructed around 500 prompts, with each prompt containing one or more verifiable instructions. We show evaluation results of two widely available LLMs on the market. Our code and data can be found at https://github.com/google-research/google-research/tree/master/instruction_following_eval
Qwen3.5-27B leads with 95.0%, followed by
Qwen3.7 Max at 94.3% and
Qwen3.6 Plus at 94.3%.
Progress Over Time
Interactive timeline showing model performance evolution on IFEval
IFEval Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Alibaba Cloud / Qwen Team | 27B | 262K | $0.30 / $2.40 | ||
| 2 | Alibaba Cloud / Qwen Team | — | 1.0M | $1.25 / $3.75 | ||
| 2 | Alibaba Cloud / Qwen Team | — | 1.0M | $0.50 / $3.00 | ||
| 4 | OpenAI | — | — | — | ||
| 5 | Alibaba Cloud / Qwen Team | 122B | 262K | $0.40 / $3.20 | ||
| 6 | Anthropic | — | — | — | ||
| 7 | Alibaba Cloud / Qwen Team | 397B | 262K | $0.60 / $3.60 | ||
| 8 | 70B | — | — | |||
| 8 | Amazon | — | — | — | ||
| 10 | Alibaba Cloud / Qwen Team | 35B | 262K | $0.25 / $2.00 | ||
| 11 | Alibaba Cloud / Qwen Team | 9B | — | — | ||
| 12 | Google | 27B | — | — | ||
| 13 | NVIDIA | 9B | — | — | ||
| 14 | Google | 4B | — | — | ||
| 15 | Moonshot AI | 1.0T | — | — | ||
| 15 | Moonshot AI | 1.0T | — | — | ||
| 15 | Alibaba Cloud / Qwen Team | 4B | — | — | ||
| 18 | Amazon | — | — | — | ||
| 19 | Meituan | 560B | 128K | $0.30 / $1.20 | ||
| 20 | 253B | — | — | |||
| 21 | Alibaba Cloud / Qwen Team | 80B | — | — | ||
| 21 | Google | 12B | — | — | ||
| 23 | Alibaba Cloud / Qwen Team | 235B | — | — | ||
| 24 | 405B | — | — | |||
| 25 | Alibaba Cloud / Qwen Team | 236B | — | — | ||
| 25 | OpenAI | — | — | — | ||
| 27 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 27 | Alibaba Cloud / Qwen Team | 235B | — | — | ||
| 27 | Alibaba Cloud / Qwen Team | 236B | 262K | $0.30 / $1.50 | ||
| 30 | Alibaba Cloud / Qwen Team | 80B | — | — | ||
| 31 | 70B | — | — | |||
| 32 | OpenAI | — | 1.0M | $2.00 / $8.00 | ||
| 33 | Moonshot AI | — | — | — | ||
| 33 | Amazon | — | — | — | ||
| 35 | DeepSeek | 671B | — | — | ||
| 36 | Alibaba Cloud / Qwen Team | 31B | — | — | ||
| 37 | Microsoft | 14B | — | — | ||
| 38 | Sarvam AI | 105B | — | — | ||
| 39 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 40 | OpenAI | — | 1.0M | $0.40 / $1.60 | ||
| 40 | Alibaba Cloud / Qwen Team | 73B | — | — | ||
| 42 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 43 | Alibaba Cloud / Qwen Team | 9B | 262K | $0.08 / $0.50 | ||
| 44 | Microsoft | 14B | — | — | ||
| 45 | Alibaba Cloud / Qwen Team | 9B | 262K | $0.18 / $2.09 | ||
| 46 | Mistral AI | 24B | — | — | ||
| 47 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 / $1.00 | ||
| 48 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 / $0.60 | ||
| 49 | Alibaba Cloud / Qwen Team | 31B | — | — | ||
| 50 | OpenAI | — | 128K | $2.50 / $10.00 |
FAQ
Common questions about IFEval.
More evaluations to explore
Related benchmarks in the same category
A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, with PhD experts reaching 65% accuracy.
A more robust and challenging multi-task language understanding benchmark that extends MMLU by expanding multiple-choice options from 4 to 10, eliminating trivial questions, and focusing on reasoning-intensive tasks. Features over 12,000 curated questions across 14 domains and causes a 16-33% accuracy drop compared to original MMLU.
Massive Multitask Language Understanding benchmark testing knowledge across 57 diverse subjects including STEM, humanities, social sciences, and professional domains
LiveCodeBench is a holistic and contamination-free evaluation benchmark for large language models for code. It continuously collects new problems from programming contests (LeetCode, AtCoder, CodeForces) and evaluates four different scenarios: code generation, self-repair, code execution, and test output prediction. Problems are annotated with release dates to enable evaluation on unseen problems released after a model's training cutoff.
MMMU (Massive Multi-discipline Multimodal Understanding) is a benchmark designed to evaluate multimodal models on college-level subject knowledge and deliberate reasoning. Contains 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering across 30 subjects and 183 subfields.
A more robust multi-discipline multimodal understanding benchmark that enhances MMMU through a three-step process: filtering text-only answerable questions, augmenting candidate options, and introducing vision-only input settings. Achieves significantly lower model performance (16.8-26.9%) compared to original MMMU, providing more rigorous evaluation that closely mimics real-world scenarios.