ACEBench Leaderboard

Progress Over Time

Interactive timeline showing model performance evolution on ACEBench

State-of-the-art frontier

Open

Proprietary

ACEBench Leaderboard

2 models

				Context	Cost	License
1	Kimi K2 Instruct Moonshot AI		1.0T	200K	$0.50 / $0.50
1	Kimi K2-Instruct-0905 Moonshot AI		1.0T	—	—

FAQ

Common questions about ACEBench

ACEBench is a comprehensive benchmark for evaluating Large Language Models' tool usage capabilities across three primary evaluation types: Normal (basic tool usage scenarios), Special (tool usage with ambiguous or incomplete instructions), and Agent (multi-agent interactions simulating real-world dialogues). The benchmark covers 4,538 APIs across 8 major domains and 68 sub-domains including technology, finance, entertainment, society, health, culture, and environment, supporting both English and Chinese languages.

The ACEBench paper is available at https://arxiv.org/abs/2501.12851. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.

The ACEBench dataset is available at https://github.com/ACEBench/ACEBench.

The ACEBench leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, Kimi K2 Instruct by Moonshot AI leads with a score of 0.765. The average score across all models is 0.765.

The highest ACEBench score is 0.765, achieved by Kimi K2 Instruct from Moonshot AI.

2 models have been evaluated on the ACEBench benchmark, with 0 verified results and 2 self-reported results.

ACEBench is categorized under tool calling, finance, general, healthcare, and reasoning. The benchmark evaluates text models.

ACEBench

Progress Over Time

ACEBench Leaderboard

FAQ

What is the ACEBench benchmark?

Where can I find the ACEBench paper?

Where can I find the ACEBench dataset?

What is the ACEBench leaderboard?

What is the highest ACEBench score?

How many models are evaluated on ACEBench?

What categories does ACEBench cover?