BFCL v2
Progress Over Time
Interactive timeline showing model performance evolution on BFCL v2
BFCL v2 Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | 70B | — | — | |||
| 2 | 253B | — | — | |||
| 3 | 50B | — | — | |||
| 4 | 3B | — | — | |||
| 5 | 8B | — | — |
What is BFCL v2?
Berkeley Function Calling Leaderboard (BFCL) v2 is a comprehensive benchmark for evaluating large language models' function calling capabilities. It features 2,251 question-function-answer pairs with enterprise and OSS-contributed functions, addressing data contamination and bias through live, user-contributed scenarios. The benchmark evaluates AST accuracy, executable accuracy, irrelevance detection, and relevance detection across multiple programming languages (Python, Java, JavaScript) and includes complex real-world function calling scenarios with multi-lingual prompts.
BFCL v2 is a text benchmark evaluating models on reasoning, general, and tool calling tasks. LLM Stats tracks 5 models on this benchmark, scored on a 0–1 scale. The current average is 0.7, with the leader at 0.8.
Compare leaders on the best AI for reasoning, best AI for general and best AI for tool calling leaderboards.
Current leaders
Llama 3.3 70B Instruct from Meta currently leads the BFCL v2 leaderboard with a score of 0.773 across 5 evaluated AI models.
Source paper
- Title
- Gorilla: Large Language Model Connected with Massive APIs
- Authors
- Shishir G. Patil, Tianjun Zhang, Xin Wang, Joseph E. Gonzalez
- Published
- arXiv
- 2305.15334
Abstract
Large Language Models (LLMs) have seen an impressive wave of advances recently, with models now excelling in a variety of tasks, such as mathematical reasoning and program synthesis. However, their potential to effectively use tools via API calls remains unfulfilled. This is a challenging task even for today's state-of-the-art LLMs such as GPT-4, largely due to their inability to generate accurate input arguments and their tendency to hallucinate the wrong usage of an API call. We release Gorilla, a finetuned LLaMA-based model that surpasses the performance of GPT-4 on writing API calls. When combined with a document retriever, Gorilla demonstrates a strong capability to adapt to test-time document changes, enabling flexible user updates or version changes. It also substantially mitigates the issue of hallucination, commonly encountered when prompting LLMs directly. To evaluate the model's ability, we introduce APIBench, a comprehensive dataset consisting of HuggingFace, TorchHub, and TensorHub APIs. The successful integration of the retrieval system with Gorilla demonstrates the potential for LLMs to use tools more accurately, keep up with frequently updated documentation, and consequently increase the reliability and applicability of their outputs. Gorilla's code, model, data, and demo are available at https://gorilla.cs.berkeley.edu
FAQ
Common questions about the BFCL v2 benchmark and leaderboard.