BankerToolBench

BankerToolBench is a public benchmark that evaluates models on banking and finance tool-use tasks. Models are scored against dataset rubrics, measuring their ability to correctly invoke tools and complete multi-step financial workflows.

MiniMax M3 from MiniMax currently leads the BankerToolBench leaderboard with a score of 0.761 across 1 evaluated AI models.

About this benchmark

What BankerToolBench measures

BankerToolBench is a text benchmark that evaluates large language models on finance, tool calling, and agents tasks. LLM Stats tracks 1 model on this benchmark, with a maximum possible score of 1. Current average across reported models is 0.8, with the leader reaching 0.8.

Compare leaders on the best AI for finance, best AI for tool calling and best AI for agents leaderboards.

MiniMaxMiniMax M3 leads with 76.1%.

Progress Over Time

Interactive timeline showing model performance evolution on BankerToolBench

State-of-the-art frontier
Open
Proprietary

BankerToolBench Leaderboard

1 models
ContextCostLicense
1
MiniMax
MiniMax
1.0M$0.60 / $2.40
Notice missing or incorrect data?

FAQ

Common questions about BankerToolBench.

What is the BankerToolBench benchmark?

BankerToolBench is a public benchmark that evaluates models on banking and finance tool-use tasks. Models are scored against dataset rubrics, measuring their ability to correctly invoke tools and complete multi-step financial workflows.

What is the BankerToolBench leaderboard?

The BankerToolBench leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, MiniMax M3 by MiniMax leads with a score of 0.761. The average score across all models is 0.761.

What is the highest BankerToolBench score?

The highest BankerToolBench score is 0.761, achieved by MiniMax M3 from MiniMax.

How many models are evaluated on BankerToolBench?

1 models have been evaluated on the BankerToolBench benchmark, with 0 verified results and 1 self-reported results.

What categories does BankerToolBench cover?

BankerToolBench is categorized under finance, tool calling, and agents. The benchmark evaluates text models.

What is the best open-source model on BankerToolBench?

MiniMax M3 by MiniMax is the top-ranked open-source model on BankerToolBench, with a score of 0.761 (rank #1).

Which model offers the best value on BankerToolBench?

Among models scoring within 10% of the leader, MiniMax M3 from MiniMax is the cheapest, at $0.60 per million input tokens with a score of 0.761.

How recent are the BankerToolBench leaderboard results?

The BankerToolBench leaderboard was last updated in June 2026 and currently includes 1 evaluated models.

More evaluations to explore

Related benchmarks in the same category

View all finance
MMLU-Pro

A more robust and challenging multi-task language understanding benchmark that extends MMLU by expanding multiple-choice options from 4 to 10, eliminating trivial questions, and focusing on reasoning-intensive tasks. Features over 12,000 curated questions across 14 domains and causes a 16-33% accuracy drop compared to original MMLU.

finance
124 models
MMLU

Massive Multitask Language Understanding benchmark testing knowledge across 57 diverse subjects including STEM, humanities, social sciences, and professional domains

finance
99 models
BrowseComp

BrowseComp is a benchmark comprising 1,266 questions that challenge AI agents to persistently navigate the internet in search of hard-to-find, entangled information. The benchmark measures agents' ability to exercise persistence in information gathering, demonstrate creativity in web navigation, and find concise, verifiable answers. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers.

agents
48 models
Terminal-Bench 2.0

Terminal-Bench 2.0 is an updated benchmark for testing AI agents' tool use ability to operate a computer via terminal. It evaluates how well models can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities.

tool calling
44 models
Tau2 Telecom

τ²-Bench telecom domain evaluates conversational agents in a dual-control environment modeled as a Dec-POMDP, where both agent and user use tools in shared telecommunications troubleshooting scenarios that test coordination and communication capabilities.

tool calling
34 models
SuperGPQA

SuperGPQA is a comprehensive benchmark that evaluates large language models across 285 graduate-level academic disciplines. The benchmark contains 25,957 questions covering 13 broad disciplinary areas including Engineering, Medicine, Science, and Law, with specialized fields in light industry, agriculture, and service-oriented domains. It employs a Human-LLM collaborative filtering mechanism with over 80 expert annotators to create challenging questions that assess graduate-level knowledge and reasoning capabilities.

finance
31 models