SpreadSheetBench-v1

SpreadSheetBench-v1 evaluates office automation agents on spreadsheet reasoning and manipulation tasks, measuring the ability to analyze, transform, and operate on spreadsheet data through tools.

MiniMax M3 from MiniMax currently leads the SpreadSheetBench-v1 leaderboard with a score of 0.893 across 2 evaluated AI models.

About this benchmark

What SpreadSheetBench-v1 measures

SpreadSheetBench-v1 is a text benchmark that evaluates large language models on tool calling, productivity, and agents tasks. LLM Stats tracks 2 models on this benchmark, with a maximum possible score of 1. Current average across reported models is 0.9, with the leader reaching 0.9.

Compare leaders on the best AI for tool calling, best AI for productivity and best AI for agents leaderboards.

MiniMaxMiniMax M3 leads with 89.3%, followed by Alibaba Cloud / Qwen TeamQwen3.7 Max at 87.0%.

Progress Over Time

Interactive timeline showing model performance evolution on SpreadSheetBench-v1

State-of-the-art frontier
Open
Proprietary

SpreadSheetBench-v1 Leaderboard

2 models
ContextCostLicense
1
MiniMax
MiniMax
1.0M$0.60 / $2.40
2
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
1.0M$1.25 / $3.75
Notice missing or incorrect data?

FAQ

Common questions about SpreadSheetBench-v1.

What is the SpreadSheetBench-v1 benchmark?

SpreadSheetBench-v1 evaluates office automation agents on spreadsheet reasoning and manipulation tasks, measuring the ability to analyze, transform, and operate on spreadsheet data through tools.

What is the SpreadSheetBench-v1 leaderboard?

The SpreadSheetBench-v1 leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, MiniMax M3 by MiniMax leads with a score of 0.893. The average score across all models is 0.882.

What is the highest SpreadSheetBench-v1 score?

The highest SpreadSheetBench-v1 score is 0.893, achieved by MiniMax M3 from MiniMax.

How many models are evaluated on SpreadSheetBench-v1?

2 models have been evaluated on the SpreadSheetBench-v1 benchmark, with 0 verified results and 2 self-reported results.

What categories does SpreadSheetBench-v1 cover?

SpreadSheetBench-v1 is categorized under tool calling, productivity, and agents. The benchmark evaluates text models.

What is the best open-source model on SpreadSheetBench-v1?

MiniMax M3 by MiniMax is the top-ranked open-source model on SpreadSheetBench-v1, with a score of 0.893 (rank #1).

Which model offers the best value on SpreadSheetBench-v1?

Among models scoring within 10% of the leader, MiniMax M3 from MiniMax is the cheapest, at $0.60 per million input tokens with a score of 0.893.

How recent are the SpreadSheetBench-v1 leaderboard results?

The SpreadSheetBench-v1 leaderboard was last updated in June 2026 and currently includes 2 evaluated models.

More evaluations to explore

Related benchmarks in the same category

View all tool calling
BrowseComp

BrowseComp is a benchmark comprising 1,266 questions that challenge AI agents to persistently navigate the internet in search of hard-to-find, entangled information. The benchmark measures agents' ability to exercise persistence in information gathering, demonstrate creativity in web navigation, and find concise, verifiable answers. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers.

agents
48 models
Terminal-Bench 2.0

Terminal-Bench 2.0 is an updated benchmark for testing AI agents' tool use ability to operate a computer via terminal. It evaluates how well models can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities.

tool calling
44 models
Tau2 Telecom

τ²-Bench telecom domain evaluates conversational agents in a dual-control environment modeled as a Dec-POMDP, where both agent and user use tools in shared telecommunications troubleshooting scenarios that test coordination and communication capabilities.

tool calling
34 models
SWE-Bench Pro

SWE-Bench Pro is an advanced version of SWE-Bench that evaluates language models on complex, real-world software engineering tasks requiring extended reasoning and multi-step problem solving.

agents
27 models
Tau2 Retail

τ²-bench retail domain evaluates conversational AI agents in customer service scenarios within a dual-control environment where both agent and user can interact with tools. Tests tool-agent-user interaction, rule adherence, and task consistency in retail customer support contexts.

tool calling
26 models
TAU-bench Retail

A benchmark for evaluating tool-agent-user interaction in retail environments. Tests language agents' ability to handle dynamic conversations with users while using domain-specific API tools and following policy guidelines. Evaluates agents on tasks like order cancellations, address changes, and order status checks through multi-turn conversations.

tool calling
25 models