BFCL_v3_MultiTurn

Paper

Progress Over Time

Interactive timeline showing model performance evolution on BFCL_v3_MultiTurn

State-of-the-art frontier
Open
Proprietary

BFCL_v3_MultiTurn Leaderboard

2 models
ContextCostLicense
1230B1.0M$0.30 / $1.20
29B
Notice missing or incorrect data?
About this benchmark

What is BFCL_v3_MultiTurn?

Berkeley Function Calling Leaderboard (BFCL) V3 MultiTurn benchmark that evaluates large language models' ability to handle multi-turn and multi-step function calling scenarios. The benchmark introduces complex interactions requiring models to manage sequential function calls, handle conversational context across multiple turns, and make dynamic decisions about when and how to use available functions. BFCL V3 uses state-based evaluation by verifying the actual state of API systems after function execution, providing more realistic assessment of function calling capabilities in agentic applications.

BFCL_v3_MultiTurn is a text benchmark evaluating models on reasoning, general, and tool calling tasks. LLM Stats tracks 2 models on this benchmark, scored on a 0–1 scale. The current average is 0.7, with the leader at 0.8.

Compare leaders on the best AI for reasoning, best AI for general and best AI for tool calling leaderboards.

Current leaders

MiniMax M2.5 from MiniMax currently leads the BFCL_v3_MultiTurn leaderboard with a score of 0.768 across 2 evaluated AI models.

1MiniMax M2.5MiniMax76.8%

FAQ

Common questions about the BFCL_v3_MultiTurn benchmark and leaderboard.

What is the BFCL_v3_MultiTurn benchmark?

Berkeley Function Calling Leaderboard (BFCL) V3 MultiTurn benchmark that evaluates large language models' ability to handle multi-turn and multi-step function calling scenarios. The benchmark introduces complex interactions requiring models to manage sequential function calls, handle conversational context across multiple turns, and make dynamic decisions about when and how to use available functions. BFCL V3 uses state-based evaluation by verifying the actual state of API systems after function execution, providing more realistic assessment of function calling capabilities in agentic applications.

What is the BFCL_v3_MultiTurn leaderboard?

The BFCL_v3_MultiTurn leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, MiniMax M2.5 by MiniMax leads with a score of 0.768. The average score across all models is 0.719.

What is the highest BFCL_v3_MultiTurn score?

The highest BFCL_v3_MultiTurn score is 0.768, achieved by MiniMax M2.5 from MiniMax.

How many models are evaluated on BFCL_v3_MultiTurn?

2 models have been evaluated on the BFCL_v3_MultiTurn benchmark, with 0 verified results and 2 self-reported results.

Where can I find the BFCL_v3_MultiTurn paper?

The BFCL_v3_MultiTurn paper is available at https://openreview.net/forum?id=2GmDdhBdDk. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does BFCL_v3_MultiTurn cover?

BFCL_v3_MultiTurn is categorized under reasoning, general, and tool calling. The benchmark evaluates text models.

What is the best open-source model on BFCL_v3_MultiTurn?

MiniMax M2.5 by MiniMax is the top-ranked open-source model on BFCL_v3_MultiTurn, with a score of 0.768 (rank #1).

Which model offers the best value on BFCL_v3_MultiTurn?

Among models scoring within 10% of the leader, MiniMax M2.5 from MiniMax is the cheapest, at $0.30 per million input tokens with a score of 0.768.

How recent are the BFCL_v3_MultiTurn leaderboard results?

The BFCL_v3_MultiTurn leaderboard was last updated in July 2026 and currently includes 2 evaluated models.