ACEBench
ACEBench is a comprehensive benchmark for evaluating Large Language Models' tool usage capabilities across three primary evaluation types: Normal (basic tool usage scenarios), Special (tool usage with ambiguous or incomplete instructions), and Agent (multi-agent interactions simulating real-world dialogues). The benchmark covers 4,538 APIs across 8 major domains and 68 sub-domains including technology, finance, entertainment, society, health, culture, and environment, supporting both English and Chinese languages.
Progress Over Time
Interactive timeline showing model performance evolution on ACEBench
State-of-the-art frontier
Open
Proprietary
ACEBench Leaderboard
2 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Moonshot AI | 1.0T | 200K | $0.50 / $0.50 | ||
| 1 | Moonshot AI | 1.0T | — | — |
Notice missing or incorrect data?
FAQ
Common questions about ACEBench
ACEBench is a comprehensive benchmark for evaluating Large Language Models' tool usage capabilities across three primary evaluation types: Normal (basic tool usage scenarios), Special (tool usage with ambiguous or incomplete instructions), and Agent (multi-agent interactions simulating real-world dialogues). The benchmark covers 4,538 APIs across 8 major domains and 68 sub-domains including technology, finance, entertainment, society, health, culture, and environment, supporting both English and Chinese languages.
The ACEBench paper is available at https://arxiv.org/abs/2501.12851. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The ACEBench dataset is available at https://github.com/ACEBench/ACEBench.
The ACEBench leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, Kimi K2 Instruct by Moonshot AI leads with a score of 0.765. The average score across all models is 0.765.
The highest ACEBench score is 0.765, achieved by Kimi K2 Instruct from Moonshot AI.
2 models have been evaluated on the ACEBench benchmark, with 0 verified results and 2 self-reported results.
ACEBench is categorized under tool calling, finance, general, healthcare, and reasoning. The benchmark evaluates text models.