IF

Instruction-Following Evaluation (IFEval) benchmark for large language models, focusing on verifiable instructions with 25 types of instructions and around 500 prompts containing one or more verifiable constraints

Paper

Progress Over Time

Interactive timeline showing model performance evolution on IF

State-of-the-art frontier
Open
Proprietary

IF Leaderboard

2 models
ContextCostLicense
124B
2
MiniMax
MiniMax
230B1.0M$0.30 / $1.20
Notice missing or incorrect data?

FAQ

Common questions about IF

Instruction-Following Evaluation (IFEval) benchmark for large language models, focusing on verifiable instructions with 25 types of instructions and around 500 prompts containing one or more verifiable constraints
The IF paper is available at https://arxiv.org/abs/2311.07911. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The IF leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, Mistral Small 3.2 24B Instruct by Mistral AI leads with a score of 0.848. The average score across all models is 0.784.
The highest IF score is 0.848, achieved by Mistral Small 3.2 24B Instruct from Mistral AI.
2 models have been evaluated on the IF benchmark, with 0 verified results and 2 self-reported results.
IF is categorized under structured output and general. The benchmark evaluates text models.