IFBench
Instruction Following Benchmark evaluating model's ability to follow complex instructions
Progress Over Time
Interactive timeline showing model performance evolution on IFBench
State-of-the-art frontier
Open
Proprietary
IFBench Leaderboard
16 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Nous Research | 70B | — | — | ||
| 2 | Alibaba Cloud / Qwen Team | 397B | 262K | $0.60 / $3.60 | ||
| 2 | Alibaba Cloud / Qwen Team | 27B | — | — | ||
| 4 | Alibaba Cloud / Qwen Team | 122B | 262K | $0.40 / $3.20 | ||
| 5 | Qwen3.6 PlusNew Alibaba Cloud / Qwen Team | — | — | — | ||
| 6 | 120B | 262K | $0.10 / $0.50 | |||
| 7 | Inception | — | 128K | $0.25 / $0.75 | ||
| 8 | Alibaba Cloud / Qwen Team | 35B | 262K | $0.25 / $2.00 | ||
| 9 | MiniMax | 230B | 1.0M | $0.30 / $1.20 | ||
| 10 | OpenAI | 117B | 131K | $0.10 / $0.50 | ||
| 11 | LG AI Research | 236B | 33K | $0.60 / $1.00 | ||
| 12 | Alibaba Cloud / Qwen Team | 9B | — | — | ||
| 13 | Alibaba Cloud / Qwen Team | 4B | — | — | ||
| 14 | Mistral AI | 119B | 256K | $0.15 / $0.60 | ||
| 15 | Alibaba Cloud / Qwen Team | 2B | — | — | ||
| 16 | Alibaba Cloud / Qwen Team | 800M | — | — |
Notice missing or incorrect data?
FAQ
Common questions about IFBench
Instruction Following Benchmark evaluating model's ability to follow complex instructions
The IFBench leaderboard ranks 16 AI models based on their performance on this benchmark. Currently, Hermes 3 70B by Nous Research leads with a score of 0.812. The average score across all models is 0.649.
The highest IFBench score is 0.812, achieved by Hermes 3 70B from Nous Research.
16 models have been evaluated on the IFBench benchmark, with 0 verified results and 16 self-reported results.
IFBench is categorized under general and instruction following. The benchmark evaluates text models.