IF

Paper

Progress Over Time

Interactive timeline showing model performance evolution on IF

State-of-the-art frontier
Open
Proprietary

IF Leaderboard

2 models
ContextCostLicense
124B
2
MiniMax
MiniMax
230B1.0M$0.30 / $1.20
Notice missing or incorrect data?
About this benchmark

What is IF?

Instruction-Following Evaluation (IFEval) benchmark for large language models, focusing on verifiable instructions with 25 types of instructions and around 500 prompts containing one or more verifiable constraints

IF is a text benchmark evaluating models on structured output and general tasks. LLM Stats tracks 2 models on this benchmark, scored on a 0–1 scale. The current average is 0.8, with the leader at 0.8.

Compare leaders on the best AI for structured output and best AI for general leaderboards.

Current leaders

Mistral Small 3.2 24B Instruct from Mistral AI currently leads the IF leaderboard with a score of 0.848 across 2 evaluated AI models.

2MiniMax M2MiniMax72.0%

Source paper

Title
Instruction-Following Evaluation for Large Language Models
Authors
Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, and 4 others
Published
Abstract

One core capability of Large Language Models (LLMs) is to follow natural language instructions. However, the evaluation of such abilities is not standardized: Human evaluations are expensive, slow, and not objectively reproducible, while LLM-based auto-evaluation is potentially biased or limited by the ability of the evaluator LLM. To overcome these issues, we introduce Instruction-Following Eval (IFEval) for large language models. IFEval is a straightforward and easy-to-reproduce evaluation benchmark. It focuses on a set of "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times". We identified 25 types of those verifiable instructions and constructed around 500 prompts, with each prompt containing one or more verifiable instructions. We show evaluation results of two widely available LLMs on the market. Our code and data can be found at https://github.com/google-research/google-research/tree/master/instruction_following_eval

FAQ

Common questions about the IF benchmark and leaderboard.

What is the IF benchmark?

Instruction-Following Evaluation (IFEval) benchmark for large language models, focusing on verifiable instructions with 25 types of instructions and around 500 prompts containing one or more verifiable constraints

What is the IF leaderboard?

The IF leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, Mistral Small 3.2 24B Instruct by Mistral AI leads with a score of 0.848. The average score across all models is 0.784.

What is the highest IF score?

The highest IF score is 0.848, achieved by Mistral Small 3.2 24B Instruct from Mistral AI.

How many models are evaluated on IF?

2 models have been evaluated on the IF benchmark, with 0 verified results and 2 self-reported results.

Where can I find the IF paper?

The IF paper is available at https://arxiv.org/abs/2311.07911. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does IF cover?

IF is categorized under structured output and general. The benchmark evaluates text models.

What is the best open-source model on IF?

Mistral Small 3.2 24B Instruct by Mistral AI is the top-ranked open-source model on IF, with a score of 0.848 (rank #1).

How recent are the IF leaderboard results?

The IF leaderboard was last updated in July 2026 and currently includes 2 evaluated models.