Benchmarks/tool calling/BFCL_v3_MultiTurn

BFCL_v3_MultiTurn

Berkeley Function Calling Leaderboard (BFCL) V3 MultiTurn benchmark that evaluates large language models' ability to handle multi-turn and multi-step function calling scenarios. The benchmark introduces complex interactions requiring models to manage sequential function calls, handle conversational context across multiple turns, and make dynamic decisions about when and how to use available functions. BFCL V3 uses state-based evaluation by verifying the actual state of API systems after function execution, providing more realistic assessment of function calling capabilities in agentic applications.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on BFCL_v3_MultiTurn

State-of-the-art frontier
Open
Proprietary

BFCL_v3_MultiTurn Leaderboard

2 models
ContextCostLicense
1230B1.0M$0.30 / $1.20
29B
Notice missing or incorrect data?

FAQ

Common questions about BFCL_v3_MultiTurn

Berkeley Function Calling Leaderboard (BFCL) V3 MultiTurn benchmark that evaluates large language models' ability to handle multi-turn and multi-step function calling scenarios. The benchmark introduces complex interactions requiring models to manage sequential function calls, handle conversational context across multiple turns, and make dynamic decisions about when and how to use available functions. BFCL V3 uses state-based evaluation by verifying the actual state of API systems after function execution, providing more realistic assessment of function calling capabilities in agentic applications.
The BFCL_v3_MultiTurn paper is available at https://openreview.net/forum?id=2GmDdhBdDk. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The BFCL_v3_MultiTurn leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, MiniMax M2.5 by MiniMax leads with a score of 0.768. The average score across all models is 0.719.
The highest BFCL_v3_MultiTurn score is 0.768, achieved by MiniMax M2.5 from MiniMax.
2 models have been evaluated on the BFCL_v3_MultiTurn benchmark, with 0 verified results and 2 self-reported results.
BFCL_v3_MultiTurn is categorized under tool calling, general, and reasoning. The benchmark evaluates text models.