IFEval

Name: IFEval Leaderboard — AI Model Scores
Creator: LLM Stats
License: https://llm-stats.com/legal/terms-of-service

Instruction-Following Evaluation (IFEval) benchmark for large language models, focusing on verifiable instructions with 25 types of instructions and around 500 prompts containing one or more verifiable constraints

Qwen3.5-27B from Alibaba Cloud / Qwen Team currently leads the IFEval leaderboard with a score of 0.950 across 64 evaluated AI models.

Paper

About this benchmark

What IFEval measures

IFEval is a text benchmark that evaluates large language models on general, instruction following, and structured output tasks. LLM Stats tracks 64 models on this benchmark, with a maximum possible score of 1. Current average across reported models is 0.8, with the leader reaching 0.9.

Compare leaders on the best AI for general, best AI for instruction following and best AI for structured output leaderboards.

Publication

Paper: Instruction-Following Evaluation for Large Language Models
Authors: Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, and 4 others
Published: November 14, 2023
arXiv: 2311.07911

Abstract

One core capability of Large Language Models (LLMs) is to follow natural language instructions. However, the evaluation of such abilities is not standardized: Human evaluations are expensive, slow, and not objectively reproducible, while LLM-based auto-evaluation is potentially biased or limited by the ability of the evaluator LLM. To overcome these issues, we introduce Instruction-Following Eval (IFEval) for large language models. IFEval is a straightforward and easy-to-reproduce evaluation benchmark. It focuses on a set of "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times". We identified 25 types of those verifiable instructions and constructed around 500 prompts, with each prompt containing one or more verifiable instructions. We show evaluation results of two widely available LLMs on the market. Our code and data can be found at https://github.com/google-research/google-research/tree/master/instruction_following_eval

Qwen3.5-27B leads with 95.0%, followed by Qwen3.7 Max at 94.3% and Qwen3.6 Plus at 94.3%.

Progress Over Time

Interactive timeline showing model performance evolution on IFEval

State-of-the-art frontier

Open

Proprietary

IFEval Leaderboard

64 models

			Context	Cost
1	Qwen3.5-27B Alibaba Cloud / Qwen Team	27B	262K	$0.30 / $2.40
2	Qwen3.7 Max Alibaba Cloud / Qwen Team	—	1.0M	$1.25 / $3.75
2	Qwen3.6 Plus Alibaba Cloud / Qwen Team	—	1.0M	$0.50 / $3.00
4	o3-mini OpenAI	—	—	—
5	Qwen3.5-122B-A10B Alibaba Cloud / Qwen Team	122B	262K	$0.40 / $3.20
6	Claude 3.7 Sonnet Anthropic	—	—	—
7	Qwen3.5-397B-A17B Alibaba Cloud / Qwen Team	397B	262K	$0.60 / $3.60
8	Llama 3.3 70B Instruct Meta	70B	—	—
8	Nova Pro Amazon	—	—	—
10	Qwen3.5-35B-A3B Alibaba Cloud / Qwen Team	35B	262K	$0.25 / $2.00
11	Qwen3.5-9B Alibaba Cloud / Qwen Team	9B	—	—
12	Gemma 3 27B Google	27B	—	—
13	Nemotron Nano 9B v2 NVIDIA	9B	—	—
14	Gemma 3 4B Google	4B	—	—
15	Kimi K2-Instruct-0905 Moonshot AI	1.0T	—	—
15	Kimi K2 Instruct Moonshot AI	1.0T	—	—
15	Qwen3.5-4B Alibaba Cloud / Qwen Team	4B	—	—
18	Nova Lite Amazon	—	—	—
19	LongCat-Flash-Chat Meituan	560B	128K	$0.30 / $1.20
20	Llama 3.1 Nemotron Ultra 253B v1 NVIDIA	253B	—	—
21	Qwen3-Next-80B-A3B-Thinking Alibaba Cloud / Qwen Team	80B	—	—
21	Gemma 3 12B Google	12B	—	—
23	Qwen3-235B-A22B-Instruct-2507 Alibaba Cloud / Qwen Team	235B	—	—
24	Llama 3.1 405B Instruct Meta	405B	—	—
25	Qwen3 VL 235B A22B Thinking Alibaba Cloud / Qwen Team	236B	—	—
25	GPT-4.5 OpenAI	—	—	—
27	Qwen3 VL 32B Thinking Alibaba Cloud / Qwen Team	33B	—	—
27	Qwen3-235B-A22B-Thinking-2507 Alibaba Cloud / Qwen Team	235B	—	—
27	Qwen3 VL 235B A22B Instruct Alibaba Cloud / Qwen Team	236B	262K	$0.30 / $1.50
30	Qwen3-Next-80B-A3B-Instruct Alibaba Cloud / Qwen Team	80B	—	—
31	Llama 3.1 70B Instruct Meta	70B	—	—
32	GPT-4.1 OpenAI	—	1.0M	$2.00 / $8.00
33	Kimi-k1.5 Moonshot AI	—	—	—
33	Nova Micro Amazon	—	—	—
35	DeepSeek-V3 DeepSeek	671B	—	—
36	Qwen3 VL 30B A3B Instruct Alibaba Cloud / Qwen Team	31B	—	—
37	Phi 4 Reasoning Plus Microsoft	14B	—	—
38	Sarvam-105B Sarvam AI	105B	—	—
39	Qwen3 VL 32B Instruct Alibaba Cloud / Qwen Team	33B	—	—
40	GPT-4.1 mini OpenAI	—	1.0M	$0.40 / $1.60
40	Qwen2.5 72B Instruct Alibaba Cloud / Qwen Team	73B	—	—
42	QwQ-32B Alibaba Cloud / Qwen Team	33B	—	—
43	Qwen3 VL 8B Instruct Alibaba Cloud / Qwen Team	9B	262K	$0.08 / $0.50
44	Phi 4 Reasoning Microsoft	14B	—	—
45	Qwen3 VL 8B Thinking Alibaba Cloud / Qwen Team	9B	262K	$0.18 / $2.09
46	Mistral Small 3 24B Instruct Mistral AI	24B	—	—
47	Qwen3 VL 4B Thinking Alibaba Cloud / Qwen Team	4B	262K	$0.10 / $1.00
48	Qwen3 VL 4B Instruct Alibaba Cloud / Qwen Team	4B	262K	$0.10 / $0.60
49	Qwen3 VL 30B A3B Thinking Alibaba Cloud / Qwen Team	31B	—	—
50	GPT-4o OpenAI	—	128K	$2.50 / $10.00

1–50 of 64

1/2

Notice missing or incorrect data?

FAQ

Common questions about IFEval.

What is the IFEval benchmark?

What is the IFEval leaderboard?

The IFEval leaderboard ranks 64 AI models based on their performance on this benchmark. Currently, Qwen3.5-27B by Alibaba Cloud / Qwen Team leads with a score of 0.950. The average score across all models is 0.844.

What is the highest IFEval score?

The highest IFEval score is 0.950, achieved by Qwen3.5-27B from Alibaba Cloud / Qwen Team.

How many models are evaluated on IFEval?

64 models have been evaluated on the IFEval benchmark, with 0 verified results and 64 self-reported results.

Where can I find the IFEval paper?

The IFEval paper is available at https://arxiv.org/abs/2311.07911. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does IFEval cover?

IFEval is categorized under general, instruction following, and structured output. The benchmark evaluates text models.

What is the best open-source model on IFEval?

Qwen3.5-27B by Alibaba Cloud / Qwen Team is the top-ranked open-source model on IFEval, with a score of 0.950 (rank #1).

Which model offers the best value on IFEval?

Among models scoring within 10% of the leader, Qwen3.5-35B-A3B from Alibaba Cloud / Qwen Team is the cheapest, at $0.25 per million input tokens with a score of 0.919.

How recent are the IFEval leaderboard results?

The IFEval leaderboard was last updated in June 2026 and currently includes 64 evaluated models.

More evaluations to explore

Related benchmarks in the same category

View all general →

GPQA

A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, with PhD experts reaching 65% accuracy.

general

216 models

MMLU-Pro

A more robust and challenging multi-task language understanding benchmark that extends MMLU by expanding multiple-choice options from 4 to 10, eliminating trivial questions, and focusing on reasoning-intensive tasks. Features over 12,000 curated questions across 14 domains and causes a 16-33% accuracy drop compared to original MMLU.

general

120 models

MMLU

Massive Multitask Language Understanding benchmark testing knowledge across 57 diverse subjects including STEM, humanities, social sciences, and professional domains

general

99 models

LiveCodeBench

LiveCodeBench is a holistic and contamination-free evaluation benchmark for large language models for code. It continuously collects new problems from programming contests (LeetCode, AtCoder, CodeForces) and evaluates four different scenarios: code generation, self-repair, code execution, and test output prediction. Problems are annotated with release dates to enable evaluation on unseen problems released after a model's training cutoff.

general

71 models

MMMU

MMMU (Massive Multi-discipline Multimodal Understanding) is a benchmark designed to evaluate multimodal models on college-level subject knowledge and deliberate reasoning. Contains 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering across 30 subjects and 183 subfields.

generalmultimodal

62 models

MMMU-Pro

A more robust multi-discipline multimodal understanding benchmark that enhances MMMU through a three-step process: filtering text-only answerable questions, augmenting candidate options, and introducing vision-only input settings. Achieves significantly lower model performance (16.8-26.9%) compared to original MMMU, providing more rigorous evaluation that closely mimics real-world scenarios.

generalmultimodal

49 models