API-Bank

Paper

Progress Over Time

Interactive timeline showing model performance evolution on API-Bank

State-of-the-art frontier
Open
Proprietary

API-Bank Leaderboard

3 models
ContextCostLicense
1405B
270B
38B
Notice missing or incorrect data?
About this benchmark

What is API-Bank?

A comprehensive benchmark for tool-augmented LLMs that evaluates API planning, retrieval, and calling capabilities. Contains 314 tool-use dialogues with 753 API calls across 73 API tools, designed to assess how effectively LLMs can utilize external tools and overcome obstacles in tool leveraging.

API-Bank is a text benchmark evaluating models on reasoning and tool calling tasks. LLM Stats tracks 3 models on this benchmark, scored on a 0–1 scale. The current average is 0.9, with the leader at 0.9.

Compare leaders on the best AI for reasoning and best AI for tool calling leaderboards.

Current leaders

Llama 3.1 405B Instruct from Meta currently leads the API-Bank leaderboard with a score of 0.920 across 3 evaluated AI models.

Source paper

Title
API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs
Authors
Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, and 5 others
Published
Abstract

Recent research has demonstrated that Large Language Models (LLMs) can enhance their capabilities by utilizing external tools. However, three pivotal questions remain unanswered: (1) How effective are current LLMs in utilizing tools? (2) How can we enhance LLMs' ability to utilize tools? (3) What obstacles need to be overcome to leverage tools? To address these questions, we introduce API-Bank, a groundbreaking benchmark, specifically designed for tool-augmented LLMs. For the first question, we develop a runnable evaluation system consisting of 73 API tools. We annotate 314 tool-use dialogues with 753 API calls to assess the existing LLMs' capabilities in planning, retrieving, and calling APIs. For the second question, we construct a comprehensive training set containing 1,888 tool-use dialogues from 2,138 APIs spanning 1,000 distinct domains. Using this dataset, we train Lynx, a tool-augmented LLM initialized from Alpaca. Experimental results demonstrate that GPT-3.5 exhibits improved tool utilization compared to GPT-3, while GPT-4 excels in planning. However, there is still significant potential for further improvement. Moreover, Lynx surpasses Alpaca's tool utilization performance by more than 26 pts and approaches the effectiveness of GPT-3.5. Through error analysis, we highlight the key challenges for future research in this field to answer the third question.

FAQ

Common questions about the API-Bank benchmark and leaderboard.

What is the API-Bank benchmark?

A comprehensive benchmark for tool-augmented LLMs that evaluates API planning, retrieval, and calling capabilities. Contains 314 tool-use dialogues with 753 API calls across 73 API tools, designed to assess how effectively LLMs can utilize external tools and overcome obstacles in tool leveraging.

What is the API-Bank leaderboard?

The API-Bank leaderboard ranks 3 AI models based on their performance on this benchmark. Currently, Llama 3.1 405B Instruct by Meta leads with a score of 0.920. The average score across all models is 0.882.

What is the highest API-Bank score?

The highest API-Bank score is 0.920, achieved by Llama 3.1 405B Instruct from Meta.

How many models are evaluated on API-Bank?

3 models have been evaluated on the API-Bank benchmark, with 0 verified results and 3 self-reported results.

Where can I find the API-Bank paper?

The API-Bank paper is available at https://arxiv.org/abs/2304.08244. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does API-Bank cover?

API-Bank is categorized under reasoning and tool calling. The benchmark evaluates text models.

What is the best open-source model on API-Bank?

Llama 3.1 405B Instruct by Meta is the top-ranked open-source model on API-Bank, with a score of 0.920 (rank #1).

How recent are the API-Bank leaderboard results?

The API-Bank leaderboard was last updated in July 2026 and currently includes 3 evaluated models.