BFCL v2

Paper

Progress Over Time

Interactive timeline showing model performance evolution on BFCL v2

State-of-the-art frontier
Open
Proprietary

BFCL v2 Leaderboard

5 models
ContextCostLicense
170B
2253B
350B
43B
58B
Notice missing or incorrect data?
About this benchmark

What is BFCL v2?

Berkeley Function Calling Leaderboard (BFCL) v2 is a comprehensive benchmark for evaluating large language models' function calling capabilities. It features 2,251 question-function-answer pairs with enterprise and OSS-contributed functions, addressing data contamination and bias through live, user-contributed scenarios. The benchmark evaluates AST accuracy, executable accuracy, irrelevance detection, and relevance detection across multiple programming languages (Python, Java, JavaScript) and includes complex real-world function calling scenarios with multi-lingual prompts.

BFCL v2 is a text benchmark evaluating models on reasoning, general, and tool calling tasks. LLM Stats tracks 5 models on this benchmark, scored on a 0–1 scale. The current average is 0.7, with the leader at 0.8.

Compare leaders on the best AI for reasoning, best AI for general and best AI for tool calling leaderboards.

Current leaders

Llama 3.3 70B Instruct from Meta currently leads the BFCL v2 leaderboard with a score of 0.773 across 5 evaluated AI models.

Source paper

Title
Gorilla: Large Language Model Connected with Massive APIs
Authors
Shishir G. Patil, Tianjun Zhang, Xin Wang, Joseph E. Gonzalez
Published
Abstract

Large Language Models (LLMs) have seen an impressive wave of advances recently, with models now excelling in a variety of tasks, such as mathematical reasoning and program synthesis. However, their potential to effectively use tools via API calls remains unfulfilled. This is a challenging task even for today's state-of-the-art LLMs such as GPT-4, largely due to their inability to generate accurate input arguments and their tendency to hallucinate the wrong usage of an API call. We release Gorilla, a finetuned LLaMA-based model that surpasses the performance of GPT-4 on writing API calls. When combined with a document retriever, Gorilla demonstrates a strong capability to adapt to test-time document changes, enabling flexible user updates or version changes. It also substantially mitigates the issue of hallucination, commonly encountered when prompting LLMs directly. To evaluate the model's ability, we introduce APIBench, a comprehensive dataset consisting of HuggingFace, TorchHub, and TensorHub APIs. The successful integration of the retrieval system with Gorilla demonstrates the potential for LLMs to use tools more accurately, keep up with frequently updated documentation, and consequently increase the reliability and applicability of their outputs. Gorilla's code, model, data, and demo are available at https://gorilla.cs.berkeley.edu

FAQ

Common questions about the BFCL v2 benchmark and leaderboard.

What is the BFCL v2 benchmark?

Berkeley Function Calling Leaderboard (BFCL) v2 is a comprehensive benchmark for evaluating large language models' function calling capabilities. It features 2,251 question-function-answer pairs with enterprise and OSS-contributed functions, addressing data contamination and bias through live, user-contributed scenarios. The benchmark evaluates AST accuracy, executable accuracy, irrelevance detection, and relevance detection across multiple programming languages (Python, Java, JavaScript) and includes complex real-world function calling scenarios with multi-lingual prompts.

What is the BFCL v2 leaderboard?

The BFCL v2 leaderboard ranks 5 AI models based on their performance on this benchmark. Currently, Llama 3.3 70B Instruct by Meta leads with a score of 0.773. The average score across all models is 0.711.

What is the highest BFCL v2 score?

The highest BFCL v2 score is 0.773, achieved by Llama 3.3 70B Instruct from Meta.

How many models are evaluated on BFCL v2?

5 models have been evaluated on the BFCL v2 benchmark, with 0 verified results and 5 self-reported results.

Where can I find the BFCL v2 paper?

The BFCL v2 paper is available at https://arxiv.org/abs/2305.15334. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does BFCL v2 cover?

BFCL v2 is categorized under reasoning, general, and tool calling. The benchmark evaluates text models with multilingual support.

What is the best open-source model on BFCL v2?

Llama 3.3 70B Instruct by Meta is the top-ranked open-source model on BFCL v2, with a score of 0.773 (rank #1).

How recent are the BFCL v2 leaderboard results?

The BFCL v2 leaderboard was last updated in July 2026 and currently includes 5 evaluated models.