Back to blog
Model Release·Technical Deep Dive

Unisound U2: Benchmarks, Pricing, and Full Specs

Unisound U2 is a 266B-total / 10B-active MoE built for agents. Independently verified: 86.9% GPQA Diamond, 85.8% MATH-500, 73.4% SWE-bench Verified, at $0.15/$0.30 per 1M tokens.

Sebastian Crossa
Sebastian Crossa
Co-Founder @ LLM Stats
·10 min read
Unisound U2: Benchmarks, Pricing, and Full Specs

Unisound released U2 on June 8, 2026, its next-generation flagship and the first model the company has positioned squarely as an agent foundation model. Unisound (Yunzhisheng, HKEX: 9678) is a Beijing AI company founded in 2012, best known for conversational and healthcare AI deployed across more than 100 hospitals, and for its earlier 60B-parameter UniGPT model. U2 is a different kind of release: a 266B-total, 10B-active Mixture-of-Experts model built to run long, multi-step workflows on its own.

The headline is intelligence density. U2 posts frontier-class reasoning and coding scores from a small active footprint, and it prices aggressively at $0.15 / $0.30 per million tokens. Every benchmark in this post was run on the LLM Stats ZeroEval hub, so the numbers below are independently verified rather than self-reported by the lab.

Key Numbers

Unisound U2 · June 5, 2026

0.0%
GPQA Diamond
0.0%
MATH-500
0.0%
SWE-bench Verified
0.0%
AIME 2025
0B
Active parameters
$0.00
Input / 1M tokens

266B total parameters, 10B active per token. Frontier-class reasoning and code at a fraction of frontier pricing, with every score above independently verified on the ZeroEval hub.

At a Glance

  • Release date: June 8, 2026. Announced and released the same day.
  • Developer: Unisound / Yunzhisheng (HKEX: 9678), Beijing.
  • Model ID: u2.
  • Architecture: Mixture-of-Experts, 266B total parameters, 10B active per token.
  • Training tokens: 15 trillion.
  • Modalities: Text input, text output. Not multimodal at launch.
  • Knowledge cutoff: January 31, 2026.
  • License: Proprietary.
  • Pricing: $0.15 input / $0.30 output / $0.003 cached input per 1M tokens.
  • API: OpenAI-compatible MaaS endpoint (maas-api.hivoice.cn), plus LLM Stats.
  • Context window: Not specified by Unisound at launch.

The Three Advances

Unisound frames U2 around three claims: high intelligence density, high token efficiency, and a native agent architecture. They are tightly related. The first two are two sides of the same engineering bet, that careful data curation and dense semantic representation let a smaller model do the work of a larger one.

What U2 is built on

Unisound · 3 advances

Three bets,
one model.

01

Intelligence Density

More capability packed into a smaller footprint.

  • Refined data curation over 15T training tokens
  • Dense semantic representation per parameter
  • Frontier-class scores from 10B active parameters
10B
active / 266B total
02

Token Efficiency

Higher information yield for every token spent.

  • Fewer tokens to reach the same answer
  • Lower inference cost without trading away capability
  • A 50x discount on cached input tokens
$0.15
per 1M input tokens
03

Native Agent Architecture

Decompose, execute, verify, and optimize in one loop.

  • Native reasoning-path distillation framework
  • Closed-loop task execution, not single-shot answers
  • Built to chain 100+ steps in real-world workflows
100+
steps per workflow
Framing per Unisound's U2 announcement. Step count refers to multi-step agentic workflows the model is designed to run end to end.

The number that anchors all of this is the activation ratio. 10B active out of 266B total means roughly 3.8% of the network fires on any given token, so you pay inference cost closer to a 10B model while drawing on the knowledge of a 266B one. Combined with a per-token price of $0.15 input / $0.30 output, that is the entire pitch: capability that used to require a frontier-tier model, at a price that makes high-volume agent loops economical. The claim only holds if the scores are real, which is where the benchmarks come in.


Benchmarks

Every score below was produced on the LLM Stats ZeroEval hub on June 15, 2026. These are not Unisound's self-reported numbers. We ran the standard subsets and report them as measured, including the ones that are unflattering.

Independently verified

ZeroEval hub · Jun 2026

Frontier reasoning.
Honest weak spots.

Reasoning & Knowledge
GPQA Diamond
diamond
86.9
MATH-500
test
85.8
AIME 2025
2025
73.3
Coding & Agents
SWE-bench Verified
verified
72.2
Terminal-Bench 2.0
full
43.8
Multi-Challenge
default
37.6
Long Context & Grounding
MRCR v2
8-needle · 4–8K
76.6
LongBench v2
short
54.4
FACTS Grounding
default
44.3
HealthBench
hard · mean
19.1
NoLiMa
hard · 16K
8.5
Accuracy unless noted. HealthBench reports mean score on the hard subset. All runs executed on the ZeroEval hub, June 15, 2026. Subsets shown beneath each benchmark.

Where U2 is strong

On structured reasoning and code, U2 lands in frontier territory. 86.9% on GPQA Diamond and 85.8% on MATH-500 are top-tier results, and 73.3% on AIME 2025 confirms it can handle competition-grade math rather than just textbook problems. 73.4% on SWE-bench Verified puts it firmly in the useful-coding-assistant band. For a model with a 10B active footprint, this is the result that makes the intelligence-density claim credible.

BenchmarkSubsetScore
GPQA Diamonddiamond86.9%
MATH-500test85.8%
AIME 2025202573.3%
SWE-bench Verifiedverified73.4%
MRCR v28-needle, 4-8K76.6%

Where U2 is weak

The profile is not uniform, and the soft spots cluster in two places. The first is grounded factuality and instruction following: 44.3% on FACTS Grounding and 37.6% on Multi-Challenge say U2 is more comfortable reasoning toward an answer than staying tightly anchored to supplied facts across many turns. The second is adversarial long-context recall: 54.4% on LongBench v2 (short) is middling, and 8.5% on NoLiMa (hard, 16K) is a genuine weakness, the kind of needle-in-a-haystack stress test where dense distractors break retrieval. HealthBench scores 19.1% on its hard subset, measured as mean score rather than accuracy.

BenchmarkSubsetScore
LongBench v2short54.4%
FACTS Groundingdefault44.3%
Terminal-Bench 2.0full43.8%
Multi-Challengedefault37.6%
HealthBenchhard, mean score19.1%
NoLiMahard, 16K8.5%

The honest read: U2 is a reasoning and coding specialist, not a long-context retrieval engine. If your workload depends on pulling exact facts out of large, noisy contexts, test it carefully before you commit. If your workload is math, code, and structured problem solving, the scores are hard to argue with at this price.


Native Agent Architecture

The third advance is the one Unisound leans on hardest. U2 was trained with what the company calls a native reasoning-path distillation framework, and the practical result is a model built to run a closed loop rather than emit a single answer. Unisound describes U2 as able to autonomously decompose, execute, verify, and optimize a task, and frames it as capable of chaining 100+ steps in complex real-world workflows.

The agent loop

native, not bolted on

It does not answer.
It finishes the task.

01
Decompose
Break the goal into ordered sub-tasks.
02
Execute
Run each step with tools and code.
03
Verify
Check results against the objective.
04
Optimize
Correct, refine, and re-plan.
Closed loop · repeats until the task is verified complete
Trained with a native reasoning-path distillation framework, U2 runs this loop end to end rather than returning a single-shot response.

The distinction matters for how you use it. A model that is good at single-turn answers still needs an external harness to plan, call tools, check its own work, and retry. U2 is designed to internalize that loop, which is why the strong SWE-bench Verified result (a benchmark that rewards iterating toward a working patch) lines up with the architecture story better than a one-shot Q&A score would. The weaker Terminal-Bench 2.0 number (43.8%) is the counterweight: fully autonomous, open-ended terminal tasks are still hard, and U2 is no exception.


Pricing & Access

DetailValue
Input price$0.15 / 1M tokens
Output price$0.30 / 1M tokens
Cached input$0.003 / 1M tokens
Total parameters266B (Mixture-of-Experts)
Active parameters10B per token
Model IDu2
APIOpenAI-compatible MaaS endpoint
LicenseProprietary

At $0.15 input and $0.30 output per million tokens, U2 lands in the cheapest tier of any model posting these reasoning scores, roughly 1 to 3% of what frontier-tier models list for. The detail that matters most for agents is the cache: $0.003 per million cached input tokens is a 50x discount, so a long system prompt or tool spec reused across dozens of loop iterations costs almost nothing after the first call. For multi-step agent harnesses that re-send the same preamble on every turn, cached input becomes the dominant cost lever, and U2 prices it to near zero.

U2 is served through Unisound's OpenAI-compatible MaaS API and is available to run on LLM Stats. There is no public open-weights release at launch; the license is proprietary.


Bottom Line

U2 is a clean expression of one idea: push intelligence density and token efficiency hard enough that a 10B-active model can do frontier-class reasoning and code, then price it so low that running it in long agent loops is a rounding error. On the benchmarks that test that thesis, GPQA Diamond, MATH-500, AIME, and SWE-bench Verified, the verified scores hold up.

The caveats are equally clear. U2 gives ground on grounded factuality and falls off sharply on adversarial long-context recall, and Unisound has not disclosed a context window. This is a reasoning and agent model first. If that matches your workload, U2 is one of the most cost-effective options available right now. If you need reliable recall over large, messy contexts, verify on your own data before you switch. The fastest way to decide is to run it against your current model in the LLM Stats Playground.

Questions

Frequently Asked Questions

  • Unisound U2 is the next-generation flagship model from Unisound (Yunzhisheng, HKEX: 9678), released June 8, 2026. It is a native agentic Mixture-of-Experts model with 266B total and 10B active parameters, designed to autonomously decompose, execute, verify, and optimize multi-step tasks in a closed loop.
  • U2 pricing is $0.15 per 1M input tokens and $0.30 per 1M output tokensthrough Unisound's MaaS API. Cached input tokens are $0.003 per 1M, a 50x discount that makes repeated long prompts in agent loops very cheap.
  • On LLM Stats' ZeroEval hub, U2 scores 86.9% on GPQA Diamond, 85.8% on MATH-500, 73.3% on AIME 2025, and 73.4% on SWE-bench Verified. It is weaker on grounded factuality (44.3% FACTS Grounding) and adversarial long-context recall (8.5% NoLiMa hard, 16K). All scores are independently verified, not self-reported.
  • U2 is a Mixture-of-Experts model with 266 billion total parameters and 10 billion active per token (about 3.8% activation). It was trained on 15 trillion tokens with a knowledge cutoff of January 31, 2026.
  • No. At launch, U2 is text in, text out. It does not accept image, audio, or video input. Unisound's earlier UniGPT model carried multimodal capabilities, but the U2 release is a text-only reasoning and agent model.
  • You can run U2 against any other model for free on the LLM Stats Playground. It is also served directly through Unisound's OpenAI-compatible MaaS API.

Continue Reading