IF
Instruction-Following Evaluation (IFEval) benchmark for large language models, focusing on verifiable instructions with 25 types of instructions and around 500 prompts containing one or more verifiable constraints
Progress Over Time
Interactive timeline showing model performance evolution on IF
State-of-the-art frontier
Open
Proprietary
IF Leaderboard
2 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Mistral AI | 24B | — | — | ||
| 2 | MiniMax | 230B | 1.0M | $0.30 / $1.20 |
Notice missing or incorrect data?
FAQ
Common questions about IF
Instruction-Following Evaluation (IFEval) benchmark for large language models, focusing on verifiable instructions with 25 types of instructions and around 500 prompts containing one or more verifiable constraints
The IF paper is available at https://arxiv.org/abs/2311.07911. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The IF leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, Mistral Small 3.2 24B Instruct by Mistral AI leads with a score of 0.848. The average score across all models is 0.784.
The highest IF score is 0.848, achieved by Mistral Small 3.2 24B Instruct from Mistral AI.
2 models have been evaluated on the IF benchmark, with 0 verified results and 2 self-reported results.
IF is categorized under structured output and general. The benchmark evaluates text models.