BFCL
Progress Over Time
Interactive timeline showing model performance evolution on BFCL
BFCL Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | 405B | — | — | |||
| 2 | 70B | — | — | |||
| 3 | 8B | — | — | |||
| 4 | Amazon | — | 1.0M | $0.33 / $2.75 | ||
| 5 | Alibaba Cloud / Qwen Team | 235B | — | — | ||
| 6 | Alibaba Cloud / Qwen Team | 33B | 128K | $0.10 / $0.44 | ||
| 7 | Alibaba Cloud / Qwen Team | 31B | 128K | $0.10 / $0.44 | ||
| 8 | Amazon | — | — | — | ||
| 9 | Amazon | — | — | — | ||
| 10 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 11 | Amazon | — | — | — |
What is BFCL?
The Berkeley Function Calling Leaderboard (BFCL) is the first comprehensive and executable function call evaluation dedicated to assessing Large Language Models' ability to invoke functions. It evaluates serial and parallel function calls across multiple programming languages (Python, Java, JavaScript, REST API) using a novel Abstract Syntax Tree (AST) evaluation method. The benchmark consists of over 2,000 question-function-answer pairs covering diverse application domains and complex use cases including multiple function calls, parallel function calls, and multi-turn interactions.
BFCL is a text benchmark evaluating models on reasoning, general, and tool calling tasks. LLM Stats tracks 11 models on this benchmark, scored on a 0–1 scale. The current average is 0.7, with the leader at 0.9.
Compare leaders on the best AI for reasoning, best AI for general and best AI for tool calling leaderboards.
Current leaders
Llama 3.1 405B Instruct from Meta currently leads the BFCL leaderboard with a score of 0.885 across 11 evaluated AI models.
FAQ
Common questions about the BFCL benchmark and leaderboard.