BFCL

Paper

Progress Over Time

Interactive timeline showing model performance evolution on BFCL

State-of-the-art frontier
Open
Proprietary

BFCL Leaderboard

11 models
ContextCostLicense
1405B
270B
38B
41.0M$0.33 / $2.75
5
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B
6
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B128K$0.10 / $0.44
7
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B128K$0.10 / $0.44
8
Amazon
Amazon
9
Amazon
Amazon
10
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
11
Notice missing or incorrect data?
About this benchmark

What is BFCL?

The Berkeley Function Calling Leaderboard (BFCL) is the first comprehensive and executable function call evaluation dedicated to assessing Large Language Models' ability to invoke functions. It evaluates serial and parallel function calls across multiple programming languages (Python, Java, JavaScript, REST API) using a novel Abstract Syntax Tree (AST) evaluation method. The benchmark consists of over 2,000 question-function-answer pairs covering diverse application domains and complex use cases including multiple function calls, parallel function calls, and multi-turn interactions.

BFCL is a text benchmark evaluating models on reasoning, general, and tool calling tasks. LLM Stats tracks 11 models on this benchmark, scored on a 0–1 scale. The current average is 0.7, with the leader at 0.9.

Compare leaders on the best AI for reasoning, best AI for general and best AI for tool calling leaderboards.

Current leaders

Llama 3.1 405B Instruct from Meta currently leads the BFCL leaderboard with a score of 0.885 across 11 evaluated AI models.

FAQ

Common questions about the BFCL benchmark and leaderboard.

What is the BFCL benchmark?

The Berkeley Function Calling Leaderboard (BFCL) is the first comprehensive and executable function call evaluation dedicated to assessing Large Language Models' ability to invoke functions. It evaluates serial and parallel function calls across multiple programming languages (Python, Java, JavaScript, REST API) using a novel Abstract Syntax Tree (AST) evaluation method. The benchmark consists of over 2,000 question-function-answer pairs covering diverse application domains and complex use cases including multiple function calls, parallel function calls, and multi-turn interactions.

What is the BFCL leaderboard?

The BFCL leaderboard ranks 11 AI models based on their performance on this benchmark. Currently, Llama 3.1 405B Instruct by Meta leads with a score of 0.885. The average score across all models is 0.720.

What is the highest BFCL score?

The highest BFCL score is 0.885, achieved by Llama 3.1 405B Instruct from Meta.

How many models are evaluated on BFCL?

11 models have been evaluated on the BFCL benchmark, with 0 verified results and 11 self-reported results.

Where can I find the BFCL paper?

The BFCL paper is available at https://openreview.net/pdf?id=2GmDdhBdDk. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does BFCL cover?

BFCL is categorized under reasoning, general, and tool calling. The benchmark evaluates text models.

What is the best open-source model on BFCL?

Llama 3.1 405B Instruct by Meta is the top-ranked open-source model on BFCL, with a score of 0.885 (rank #1).

How recent are the BFCL leaderboard results?

The BFCL leaderboard was last updated in July 2026 and currently includes 11 evaluated models.