BFCL-V4

Berkeley Function Calling Leaderboard V4 (BFCL-V4) evaluates LLMs on their ability to accurately call functions and APIs, including simple, multiple, parallel, and nested function calls across diverse programming scenarios.

Qwen3.5-397B-A17B from Alibaba Cloud / Qwen Team currently leads the BFCL-V4 leaderboard with a score of 0.729 across 8 evaluated AI models.

Alibaba Cloud / Qwen TeamQwen3.5-397B-A17B leads with 72.9%, followed by Alibaba Cloud / Qwen TeamQwen3.5-122B-A10B at 72.2% and Alibaba Cloud / Qwen TeamQwen3.5-27B at 68.5%.

Progress Over Time

Interactive timeline showing model performance evolution on BFCL-V4

State-of-the-art frontier
Open
Proprietary

BFCL-V4 Leaderboard

8 models
ContextCostLicense
1
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
397B262K$0.60 / $3.60
2
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
122B262K$0.40 / $3.20
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
27B262K$0.30 / $2.40
4
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B262K$0.25 / $2.00
5
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B
6
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B
7
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
2B
8
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
800M
Notice missing or incorrect data?

FAQ

Common questions about BFCL-V4.

What is the BFCL-V4 benchmark?

Berkeley Function Calling Leaderboard V4 (BFCL-V4) evaluates LLMs on their ability to accurately call functions and APIs, including simple, multiple, parallel, and nested function calls across diverse programming scenarios.

What is the BFCL-V4 leaderboard?

The BFCL-V4 leaderboard ranks 8 AI models based on their performance on this benchmark. Currently, Qwen3.5-397B-A17B by Alibaba Cloud / Qwen Team leads with a score of 0.729. The average score across all models is 0.583.

What is the highest BFCL-V4 score?

The highest BFCL-V4 score is 0.729, achieved by Qwen3.5-397B-A17B from Alibaba Cloud / Qwen Team.

How many models are evaluated on BFCL-V4?

8 models have been evaluated on the BFCL-V4 benchmark, with 0 verified results and 8 self-reported results.

What categories does BFCL-V4 cover?

BFCL-V4 is categorized under tool calling and agents. The benchmark evaluates text models.

More evaluations to explore

Related benchmarks in the same category

View all tool calling
BrowseComp

BrowseComp is a benchmark comprising 1,266 questions that challenge AI agents to persistently navigate the internet in search of hard-to-find, entangled information. The benchmark measures agents' ability to exercise persistence in information gathering, demonstrate creativity in web navigation, and find concise, verifiable answers. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers.

agents
46 models
Terminal-Bench 2.0

Terminal-Bench 2.0 is an updated benchmark for testing AI agents' tool use ability to operate a computer via terminal. It evaluates how well models can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities.

tool calling
40 models
Tau2 Telecom

τ²-Bench telecom domain evaluates conversational agents in a dual-control environment modeled as a Dec-POMDP, where both agent and user use tools in shared telecommunications troubleshooting scenarios that test coordination and communication capabilities.

tool calling
30 models
TAU-bench Retail

A benchmark for evaluating tool-agent-user interaction in retail environments. Tests language agents' ability to handle dynamic conversations with users while using domain-specific API tools and following policy guidelines. Evaluates agents on tasks like order cancellations, address changes, and order status checks through multi-turn conversations.

tool calling
25 models
Tau2 Retail

τ²-bench retail domain evaluates conversational AI agents in customer service scenarios within a dual-control environment where both agent and user can interact with tools. Tests tool-agent-user interaction, rule adherence, and task consistency in retail customer support contexts.

tool calling
23 models
TAU-bench Airline

Part of τ-bench (TAU-bench), a benchmark for Tool-Agent-User interaction in real-world domains. The airline domain evaluates language agents' ability to interact with users through dynamic conversations while following domain-specific rules and using API tools. Agents must handle airline-related tasks and policies reliably.

tool calling
23 models