ACEBench

ACEBench is a comprehensive benchmark for evaluating Large Language Models' tool usage capabilities across three primary evaluation types: Normal (basic tool usage scenarios), Special (tool usage with ambiguous or incomplete instructions), and Agent (multi-agent interactions simulating real-world dialogues). The benchmark covers 4,538 APIs across 8 major domains and 68 sub-domains including technology, finance, entertainment, society, health, culture, and environment, supporting both English and Chinese languages.

PaperImplementation

Progress Over Time

Interactive timeline showing model performance evolution on ACEBench

State-of-the-art frontier
Open
Proprietary

ACEBench Leaderboard

2 models
ContextCostLicense
1
Moonshot AI
Moonshot AI
1.0T200K$0.50 / $0.50
11.0T
Notice missing or incorrect data?

FAQ

Common questions about ACEBench

ACEBench is a comprehensive benchmark for evaluating Large Language Models' tool usage capabilities across three primary evaluation types: Normal (basic tool usage scenarios), Special (tool usage with ambiguous or incomplete instructions), and Agent (multi-agent interactions simulating real-world dialogues). The benchmark covers 4,538 APIs across 8 major domains and 68 sub-domains including technology, finance, entertainment, society, health, culture, and environment, supporting both English and Chinese languages.
The ACEBench paper is available at https://arxiv.org/abs/2501.12851. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The ACEBench dataset is available at https://github.com/ACEBench/ACEBench.
The ACEBench leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, Kimi K2 Instruct by Moonshot AI leads with a score of 0.765. The average score across all models is 0.765.
The highest ACEBench score is 0.765, achieved by Kimi K2 Instruct from Moonshot AI.
2 models have been evaluated on the ACEBench benchmark, with 0 verified results and 2 self-reported results.
ACEBench is categorized under tool calling, finance, general, healthcare, and reasoning. The benchmark evaluates text models.