MCP-Mark

MCP-Mark evaluates LLMs on their ability to use Model Context Protocol (MCP) tools effectively, testing tool discovery, selection, invocation, and result interpretation across diverse MCP server scenarios.

Kimi K2.6 from Moonshot AI currently leads the MCP-Mark leaderboard with a score of 0.559 across 5 evaluated AI models.

Kimi K2.6 leads with 55.9%, followed by Qwen3.6 Plus at 48.2% and Qwen3.5-397B-A17B at 46.1%.

Progress Over Time

Interactive timeline showing model performance evolution on MCP-Mark

State-of-the-art frontier

Open

Proprietary

MCP-Mark Leaderboard

5 models

			Context	Cost
1	Kimi K2.6 Moonshot AI	1.0T	262K	$0.95 / $4.00
2	Qwen3.6 Plus Alibaba Cloud / Qwen Team	—	1.0M	$0.50 / $3.00
3	Qwen3.5-397B-A17B Alibaba Cloud / Qwen Team	397B	262K	$0.60 / $3.60
4	DeepSeek-V3.2 DeepSeek	685B	164K	$0.26 / $0.38
5	Qwen3.6-35B-A3B Alibaba Cloud / Qwen Team	35B	—	—

Notice missing or incorrect data?

FAQ

Common questions about MCP-Mark.

What is the MCP-Mark benchmark?

What is the MCP-Mark leaderboard?

The MCP-Mark leaderboard ranks 5 AI models based on their performance on this benchmark. Currently, Kimi K2.6 by Moonshot AI leads with a score of 0.559. The average score across all models is 0.450.

What is the highest MCP-Mark score?

The highest MCP-Mark score is 0.559, achieved by Kimi K2.6 from Moonshot AI.

How many models are evaluated on MCP-Mark?

5 models have been evaluated on the MCP-Mark benchmark, with 0 verified results and 5 self-reported results.

What categories does MCP-Mark cover?

MCP-Mark is categorized under agents and tool calling. The benchmark evaluates text models.

More evaluations to explore

Related benchmarks in the same category

View all agents →

BrowseComp

BrowseComp is a benchmark comprising 1,266 questions that challenge AI agents to persistently navigate the internet in search of hard-to-find, entangled information. The benchmark measures agents' ability to exercise persistence in information gathering, demonstrate creativity in web navigation, and find concise, verifiable answers. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers.

agents

45 models

Terminal-Bench 2.0

Terminal-Bench 2.0 is an updated benchmark for testing AI agents' tool use ability to operate a computer via terminal. It evaluates how well models can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities.

agents

39 models

Tau2 Telecom

τ²-Bench telecom domain evaluates conversational agents in a dual-control environment modeled as a Dec-POMDP, where both agent and user use tools in shared telecommunications troubleshooting scenarios that test coordination and communication capabilities.

tool calling

30 models

TAU-bench Retail

A benchmark for evaluating tool-agent-user interaction in retail environments. Tests language agents' ability to handle dynamic conversations with users while using domain-specific API tools and following policy guidelines. Evaluates agents on tasks like order cancellations, address changes, and order status checks through multi-turn conversations.

tool calling

25 models

Tau2 Retail

τ²-bench retail domain evaluates conversational AI agents in customer service scenarios within a dual-control environment where both agent and user can interact with tools. Tests tool-agent-user interaction, rule adherence, and task consistency in retail customer support contexts.

tool calling

23 models

TAU-bench Airline

Part of τ-bench (TAU-bench), a benchmark for Tool-Agent-User interaction in real-world domains. The airline domain evaluates language agents' ability to interact with users through dynamic conversations while following domain-specific rules and using API tools. Agents must handle airline-related tasks and policies reliably.

tool calling

23 models