BFCL-V4
Progress Over Time
Interactive timeline showing model performance evolution on BFCL-V4
BFCL-V4 Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Alibaba Cloud / Qwen Team | — | 1.0M | $1.25 / $3.75 | ||
| 2 | Alibaba Cloud / Qwen Team | — | 1.0M | $0.32 / $1.28 | ||
| 2 | Alibaba Cloud / Qwen Team | 397B | — | — | ||
| 4 | Alibaba Cloud / Qwen Team | 122B | — | — | ||
| 5 | Alibaba Cloud / Qwen Team | 27B | 262K | $0.30 / $2.40 | ||
| 6 | Alibaba Cloud / Qwen Team | 35B | — | — | ||
| 7 | Alibaba Cloud / Qwen Team | 9B | — | — | ||
| 8 | Amazon | — | — | — | ||
| 9 | Amazon | — | 1.0M | $0.30 / $2.50 | ||
| 10 | Amazon | — | — | — | ||
| 11 | Alibaba Cloud / Qwen Team | 4B | — | — | ||
| 12 | Alibaba Cloud / Qwen Team | 2B | — | — | ||
| 13 | Alibaba Cloud / Qwen Team | 800M | — | — |
What is BFCL-V4?
Berkeley Function Calling Leaderboard V4 (BFCL-V4) evaluates LLMs on their ability to accurately call functions and APIs, including simple, multiple, parallel, and nested function calls across diverse programming scenarios.
BFCL-V4 is a text benchmark evaluating models on agents and tool calling tasks. LLM Stats tracks 13 models on this benchmark, scored on a 0–1 scale. The current average is 0.6, with the leader at 0.8.
Compare leaders on the best AI for agents and best AI for tool calling leaderboards.
Current leaders
Qwen3.7 Max from Alibaba Cloud / Qwen Team currently leads the BFCL-V4 leaderboard with a score of 0.750 across 13 evaluated AI models.
FAQ
Common questions about the BFCL-V4 benchmark and leaderboard.