BFCL-V4

Progress Over Time

Interactive timeline showing model performance evolution on BFCL-V4

State-of-the-art frontier
Open
Proprietary

BFCL-V4 Leaderboard

13 models
ContextCostLicense
1
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
1.0M$1.25 / $3.75
2
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
1.0M$0.32 / $1.28
2
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
397B
4
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
122B
5
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
27B262K$0.30 / $2.40
6
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B
7
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B
8
91.0M$0.30 / $2.50
10
11
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B
12
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
2B
13
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
800M
Notice missing or incorrect data?
About this benchmark

What is BFCL-V4?

Berkeley Function Calling Leaderboard V4 (BFCL-V4) evaluates LLMs on their ability to accurately call functions and APIs, including simple, multiple, parallel, and nested function calls across diverse programming scenarios.

BFCL-V4 is a text benchmark evaluating models on agents and tool calling tasks. LLM Stats tracks 13 models on this benchmark, scored on a 0–1 scale. The current average is 0.6, with the leader at 0.8.

Compare leaders on the best AI for agents and best AI for tool calling leaderboards.

Current leaders

Qwen3.7 Max from Alibaba Cloud / Qwen Team currently leads the BFCL-V4 leaderboard with a score of 0.750 across 13 evaluated AI models.

1Qwen3.7 MaxAlibaba Cloud / Qwen Team75.0%
2Qwen3.7-PlusAlibaba Cloud / Qwen Team72.9%
2Qwen3.5-397B-A17BAlibaba Cloud / Qwen Team72.9%

FAQ

Common questions about the BFCL-V4 benchmark and leaderboard.

What is the BFCL-V4 benchmark?

Berkeley Function Calling Leaderboard V4 (BFCL-V4) evaluates LLMs on their ability to accurately call functions and APIs, including simple, multiple, parallel, and nested function calls across diverse programming scenarios.

What is the BFCL-V4 leaderboard?

The BFCL-V4 leaderboard ranks 13 AI models based on their performance on this benchmark. Currently, Qwen3.7 Max by Alibaba Cloud / Qwen Team leads with a score of 0.750. The average score across all models is 0.611.

What is the highest BFCL-V4 score?

The highest BFCL-V4 score is 0.750, achieved by Qwen3.7 Max from Alibaba Cloud / Qwen Team.

How many models are evaluated on BFCL-V4?

13 models have been evaluated on the BFCL-V4 benchmark, with 0 verified results and 13 self-reported results.

What categories does BFCL-V4 cover?

BFCL-V4 is categorized under agents and tool calling. The benchmark evaluates text models.

What is the best open-source model on BFCL-V4?

Qwen3.5-397B-A17B by Alibaba Cloud / Qwen Team is the top-ranked open-source model on BFCL-V4, with a score of 0.729 (rank #2).

Which model offers the best value on BFCL-V4?

Among models scoring within 10% of the leader, Qwen3.5-27B from Alibaba Cloud / Qwen Team is the cheapest, at $0.30 per million input tokens with a score of 0.685.

How recent are the BFCL-V4 leaderboard results?

The BFCL-V4 leaderboard was last updated in July 2026 and currently includes 13 evaluated models.