BFCL v2
Berkeley Function Calling Leaderboard (BFCL) v2 is a comprehensive benchmark for evaluating large language models' function calling capabilities. It features 2,251 question-function-answer pairs with enterprise and OSS-contributed functions, addressing data contamination and bias through live, user-contributed scenarios. The benchmark evaluates AST accuracy, executable accuracy, irrelevance detection, and relevance detection across multiple programming languages (Python, Java, JavaScript) and includes complex real-world function calling scenarios with multi-lingual prompts.
Progress Over Time
Interactive timeline showing model performance evolution on BFCL v2
State-of-the-art frontier
Open
Proprietary
BFCL v2 Leaderboard
5 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | 70B | 128K | $0.20 / $0.20 | |||
| 2 | 253B | — | — | |||
| 3 | 50B | — | — | |||
| 4 | 3B | 128K | $0.01 / $0.02 | |||
| 5 | 8B | — | — |
Notice missing or incorrect data?
FAQ
Common questions about BFCL v2
Berkeley Function Calling Leaderboard (BFCL) v2 is a comprehensive benchmark for evaluating large language models' function calling capabilities. It features 2,251 question-function-answer pairs with enterprise and OSS-contributed functions, addressing data contamination and bias through live, user-contributed scenarios. The benchmark evaluates AST accuracy, executable accuracy, irrelevance detection, and relevance detection across multiple programming languages (Python, Java, JavaScript) and includes complex real-world function calling scenarios with multi-lingual prompts.
The BFCL v2 paper is available at https://arxiv.org/abs/2305.15334. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The BFCL v2 leaderboard ranks 5 AI models based on their performance on this benchmark. Currently, Llama 3.3 70B Instruct by Meta leads with a score of 0.773. The average score across all models is 0.711.
The highest BFCL v2 score is 0.773, achieved by Llama 3.3 70B Instruct from Meta.
5 models have been evaluated on the BFCL v2 benchmark, with 0 verified results and 5 self-reported results.
BFCL v2 is categorized under general, reasoning, and tool calling. The benchmark evaluates text models with multilingual support.