BFCL-V4
Berkeley Function Calling Leaderboard V4 (BFCL-V4) evaluates LLMs on their ability to accurately call functions and APIs, including simple, multiple, parallel, and nested function calls across diverse programming scenarios.
Progress Over Time
Interactive timeline showing model performance evolution on BFCL-V4
State-of-the-art frontier
Open
Proprietary
BFCL-V4 Leaderboard
8 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Alibaba Cloud / Qwen Team | 397B | 262K | $0.60 / $3.60 | ||
| 2 | Alibaba Cloud / Qwen Team | 122B | 262K | $0.40 / $3.20 | ||
| 3 | Alibaba Cloud / Qwen Team | 27B | — | — | ||
| 4 | Alibaba Cloud / Qwen Team | 35B | 262K | $0.25 / $2.00 | ||
| 5 | Alibaba Cloud / Qwen Team | 9B | — | — | ||
| 6 | Alibaba Cloud / Qwen Team | 4B | — | — | ||
| 7 | Alibaba Cloud / Qwen Team | 2B | — | — | ||
| 8 | Alibaba Cloud / Qwen Team | 800M | — | — |
Notice missing or incorrect data?
FAQ
Common questions about BFCL-V4
Berkeley Function Calling Leaderboard V4 (BFCL-V4) evaluates LLMs on their ability to accurately call functions and APIs, including simple, multiple, parallel, and nested function calls across diverse programming scenarios.
The BFCL-V4 leaderboard ranks 8 AI models based on their performance on this benchmark. Currently, Qwen3.5-397B-A17B by Alibaba Cloud / Qwen Team leads with a score of 0.729. The average score across all models is 0.583.
The highest BFCL-V4 score is 0.729, achieved by Qwen3.5-397B-A17B from Alibaba Cloud / Qwen Team.
8 models have been evaluated on the BFCL-V4 benchmark, with 0 verified results and 8 self-reported results.
BFCL-V4 is categorized under tool calling and agents. The benchmark evaluates text models.