MiMo Coding Bench

MiMo Coding Bench evaluates coding-agent capabilities on software engineering tasks reported with the MiMo model family.

MiMo-V2.5-Pro from Xiaomi currently leads the MiMo Coding Bench leaderboard with a score of 0.737 across 2 evaluated AI models.

About this benchmark

What MiMo Coding Bench measures

MiMo Coding Bench is a text benchmark that evaluates large language models on agents and code tasks. LLM Stats tracks 2 models on this benchmark, with a maximum possible score of 1. Current average across reported models is 0.7, with the leader reaching 0.7.

Compare leaders on the best AI for agents and best AI for code leaderboards.

XiaomiMiMo-V2.5-Pro leads with 73.7%, followed by XiaomiMiMo-V2.5 at 71.8%.

Progress Over Time

Interactive timeline showing model performance evolution on MiMo Coding Bench

State-of-the-art frontier
Open
Proprietary

MiMo Coding Bench Leaderboard

2 models
ContextCostLicense
11.0T1.0M$0.43 / $0.87
2
Xiaomi
Xiaomi
311B1.0M$0.17 / $0.34
Notice missing or incorrect data?

FAQ

Common questions about MiMo Coding Bench.

What is the MiMo Coding Bench benchmark?

MiMo Coding Bench evaluates coding-agent capabilities on software engineering tasks reported with the MiMo model family.

What is the MiMo Coding Bench leaderboard?

The MiMo Coding Bench leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, MiMo-V2.5-Pro by Xiaomi leads with a score of 0.737. The average score across all models is 0.728.

What is the highest MiMo Coding Bench score?

The highest MiMo Coding Bench score is 0.737, achieved by MiMo-V2.5-Pro from Xiaomi.

How many models are evaluated on MiMo Coding Bench?

2 models have been evaluated on the MiMo Coding Bench benchmark, with 0 verified results and 2 self-reported results.

What categories does MiMo Coding Bench cover?

MiMo Coding Bench is categorized under agents and code. The benchmark evaluates text models.

What is the best open-source model on MiMo Coding Bench?

MiMo-V2.5-Pro by Xiaomi is the top-ranked open-source model on MiMo Coding Bench, with a score of 0.737 (rank #1).

Which model offers the best value on MiMo Coding Bench?

Among models scoring within 10% of the leader, MiMo-V2.5 from Xiaomi is the cheapest, at $0.17 per million input tokens with a score of 0.718.

How recent are the MiMo Coding Bench leaderboard results?

The MiMo Coding Bench leaderboard was last updated in June 2026 and currently includes 2 evaluated models.

More evaluations to explore

Related benchmarks in the same category

View all agents
SWE-Bench Verified

A verified subset of 500 software engineering problems from real GitHub issues, validated by human annotators for evaluating language models' ability to resolve real-world coding issues by generating patches for Python codebases.

code
99 models
LiveCodeBench

LiveCodeBench is a holistic and contamination-free evaluation benchmark for large language models for code. It continuously collects new problems from programming contests (LeetCode, AtCoder, CodeForces) and evaluates four different scenarios: code generation, self-repair, code execution, and test output prediction. Problems are annotated with release dates to enable evaluation on unseen problems released after a model's training cutoff.

code
74 models
HumanEval

A benchmark that measures functional correctness for synthesizing programs from docstrings, consisting of 164 original programming problems assessing language comprehension, algorithms, and simple mathematics

code
66 models
BrowseComp

BrowseComp is a benchmark comprising 1,266 questions that challenge AI agents to persistently navigate the internet in search of hard-to-find, entangled information. The benchmark measures agents' ability to exercise persistence in information gathering, demonstrate creativity in web navigation, and find concise, verifiable answers. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers.

agents
48 models
Terminal-Bench 2.0

Terminal-Bench 2.0 is an updated benchmark for testing AI agents' tool use ability to operate a computer via terminal. It evaluates how well models can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities.

agents
46 models
SWE-bench Multilingual

A multilingual benchmark for issue resolving in software engineering that covers Java, TypeScript, JavaScript, Go, Rust, C, and C++. Contains 1,632 high-quality instances carefully annotated from 2,456 candidates by 68 expert annotators, designed to evaluate Large Language Models across diverse software ecosystems beyond Python.

code
30 models