BFCL-v3
Progress Over Time
Interactive timeline showing model performance evolution on BFCL-v3
BFCL-v3 Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Zhipu AI | 355B | — | — | ||
| 2 | Zhipu AI | 106B | — | — | ||
| 3 | Meituan | 560B | — | — | ||
| 4 | Alibaba Cloud / Qwen Team | 80B | — | — | ||
| 4 | Microsoft | 1.0T | — | — | ||
| 6 | Alibaba Cloud / Qwen Team | 236B | — | — | ||
| 6 | Alibaba Cloud / Qwen Team | 235B | — | — | ||
| 8 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 9 | Alibaba Cloud / Qwen Team | 235B | — | — | ||
| 10 | Alibaba Cloud / Qwen Team | 80B | — | — | ||
| 11 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 12 | Alibaba Cloud / Qwen Team | 480B | — | — | ||
| 13 | Alibaba Cloud / Qwen Team | 31B | — | — | ||
| 14 | Alibaba Cloud / Qwen Team | 236B | — | — | ||
| 15 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 / $1.00 | ||
| 16 | Alibaba Cloud / Qwen Team | 31B | — | — | ||
| 16 | Alibaba Cloud / Qwen Team | 9B | — | — | ||
| 18 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 / $0.60 | ||
| 19 | Alibaba Cloud / Qwen Team | 9B | 262K | $0.18 / $2.09 |
What is BFCL-v3?
Berkeley Function Calling Leaderboard v3 (BFCL-v3) is an advanced benchmark that evaluates large language models' function calling capabilities through multi-turn and multi-step interactions. It introduces extended conversational exchanges where models must retain contextual information across turns and execute multiple internal function calls for complex user requests. The benchmark includes 1000 test cases across domains like vehicle control, trading bots, travel booking, and file system management, using state-based evaluation to verify both system state changes and execution path correctness.
BFCL-v3 is a text benchmark evaluating models on reasoning, structured output, finance, general, agents, and tool calling tasks. LLM Stats tracks 19 models on this benchmark, scored on a 0–1 scale. The current average is 0.7, with the leader at 0.8.
Compare leaders on the best AI for reasoning, best AI for structured output, best AI for finance, best AI for general, best AI for agents and best AI for tool calling leaderboards.
Current leaders
GLM-4.5 from Zhipu AI currently leads the BFCL-v3 leaderboard with a score of 0.778 across 19 evaluated AI models.
FAQ
Common questions about the BFCL-v3 benchmark and leaderboard.