BFCL-V4

Berkeley Function Calling Leaderboard V4 (BFCL-V4) evaluates LLMs on their ability to accurately call functions and APIs, including simple, multiple, parallel, and nested function calls across diverse programming scenarios.

Progress Over Time

Interactive timeline showing model performance evolution on BFCL-V4

State-of-the-art frontier
Open
Proprietary

BFCL-V4 Leaderboard

8 models
ContextCostLicense
1
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
397B262K$0.60 / $3.60
2
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
122B262K$0.40 / $3.20
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
27B
4
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B262K$0.25 / $2.00
5
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B
6
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B
7
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
2B
8
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
800M
Notice missing or incorrect data?

FAQ

Common questions about BFCL-V4

Berkeley Function Calling Leaderboard V4 (BFCL-V4) evaluates LLMs on their ability to accurately call functions and APIs, including simple, multiple, parallel, and nested function calls across diverse programming scenarios.
The BFCL-V4 leaderboard ranks 8 AI models based on their performance on this benchmark. Currently, Qwen3.5-397B-A17B by Alibaba Cloud / Qwen Team leads with a score of 0.729. The average score across all models is 0.583.
The highest BFCL-V4 score is 0.729, achieved by Qwen3.5-397B-A17B from Alibaba Cloud / Qwen Team.
8 models have been evaluated on the BFCL-V4 benchmark, with 0 verified results and 8 self-reported results.
BFCL-V4 is categorized under tool calling and agents. The benchmark evaluates text models.