API-Bank
A comprehensive benchmark for tool-augmented LLMs that evaluates API planning, retrieval, and calling capabilities. Contains 314 tool-use dialogues with 753 API calls across 73 API tools, designed to assess how effectively LLMs can utilize external tools and overcome obstacles in tool leveraging.
Progress Over Time
Interactive timeline showing model performance evolution on API-Bank
State-of-the-art frontier
Open
Proprietary
API-Bank Leaderboard
3 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | 405B | 128K | $0.89 / $0.89 | |||
| 2 | 70B | 128K | $0.20 / $0.20 | |||
| 3 | 8B | 131K | $0.03 / $0.03 |
Notice missing or incorrect data?
FAQ
Common questions about API-Bank
A comprehensive benchmark for tool-augmented LLMs that evaluates API planning, retrieval, and calling capabilities. Contains 314 tool-use dialogues with 753 API calls across 73 API tools, designed to assess how effectively LLMs can utilize external tools and overcome obstacles in tool leveraging.
The API-Bank paper is available at https://arxiv.org/abs/2304.08244. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The API-Bank leaderboard ranks 3 AI models based on their performance on this benchmark. Currently, Llama 3.1 405B Instruct by Meta leads with a score of 0.920. The average score across all models is 0.882.
The highest API-Bank score is 0.920, achieved by Llama 3.1 405B Instruct from Meta.
3 models have been evaluated on the API-Bank benchmark, with 0 verified results and 3 self-reported results.
API-Bank is categorized under tool calling and reasoning. The benchmark evaluates text models.