WMT24++

Name: WMT24++ Leaderboard — AI Model Scores
Creator: LLM Stats
License: https://llm-stats.com/legal/terms-of-service

Paper

Progress Over Time

Interactive timeline showing model performance evolution on WMT24++

State-of-the-art frontier

Open

Proprietary

WMT24++ Leaderboard

23 models

			Context	Cost
1	Nemotron 3 Super (120B A12B) NVIDIA	120B	—	—
2	Nemotron 3 Nano (30B A3B) NVIDIA	32B	262K	$0.06 / $0.24
3	Qwen3.7 Max Alibaba Cloud / Qwen Team	—	1.0M	$1.25 / $3.75
4	Qwen3.7-Plus Alibaba Cloud / Qwen Team	—	1.0M	$0.32 / $1.28
5	Qwen3.6 Plus Alibaba Cloud / Qwen Team	—	1.0M	$0.50 / $3.00
6	Nemotron 3 Ultra (550B A55B) NVIDIA	550B	—	—
7	Command A+ Cohere	218B	—	—
8	Qwen3.5-397B-A17B Alibaba Cloud / Qwen Team	397B	—	—
9	Qwen3.5-122B-A10B Alibaba Cloud / Qwen Team	122B	—	—
10	Qwen3.5-27B Alibaba Cloud / Qwen Team	27B	262K	$0.30 / $2.40
11	Qwen3.5-35B-A3B Alibaba Cloud / Qwen Team	35B	—	—
12	Qwen3.5-9B Alibaba Cloud / Qwen Team	9B	—	—
13	Qwen3.5-4B Alibaba Cloud / Qwen Team	4B	—	—
14	Gemma 3 27B Google	27B	—	—
15	Gemma 3 12B Google	12B	—	—
16	Gemma 3n E4B Instructed Google	8B	—	—
16	Gemma 3n E4B Instructed LiteRT Preview Google	2B	—	—
18	Gemma 3 4B Google	4B	—	—
19	Qwen3.5-2B Alibaba Cloud / Qwen Team	2B	—	—
20	Gemma 3n E2B Instructed LiteRT (Preview) Google	2B	—	—
20	Gemma 3n E2B Instructed Google	8B	—	—
22	Gemma 3 1B Google	1B	—	—
23	Qwen3.5-0.8B Alibaba Cloud / Qwen Team	800M	—	—

Notice missing or incorrect data?

About this benchmark

What is WMT24++?

WMT24++ is a comprehensive multilingual machine translation benchmark that expands the WMT24 dataset to cover 55 languages and dialects. It includes human-written references and post-edits across four domains (literary, news, social, and speech) to evaluate machine translation systems and large language models across diverse linguistic contexts.

WMT24++ is a text benchmark evaluating models on language tasks. LLM Stats tracks 23 models on this benchmark, scored on a 0–1 scale. The current average is 0.6, with the leader at 0.9.

Compare leaders on the best AI for language leaderboards.

Current leaders

Nemotron 3 Super (120B A12B) from NVIDIA currently leads the WMT24++ leaderboard with a score of 0.867 across 23 evaluated AI models.

Nemotron 3 Super (120B A12B)NVIDIA86.7%

Nemotron 3 Nano (30B A3B)NVIDIA86.2%

Qwen3.7 MaxAlibaba Cloud / Qwen Team85.8%

Source paper

Title: WMT24++: Expanding the Language Coverage of WMT24 to 55 Languages & Dialects
Authors: Daniel Deutsch, Eleftheria Briakou, Isaac Caswell, Mara Finkelstein, and 13 others
Published: February 18, 2025
arXiv: 2502.12404

Abstract

As large language models (LLM) become more and more capable in languages other than English, it is important to collect benchmark datasets in order to evaluate their multilingual performance, including on tasks like machine translation (MT). In this work, we extend the WMT24 dataset to cover 55 languages by collecting new human-written references and post-edits for 46 new languages and dialects in addition to post-edits of the references in 8 out of 9 languages in the original WMT24 dataset. The dataset covers four domains: literary, news, social, and speech. We benchmark a variety of MT providers and LLMs on the collected dataset using automatic metrics and find that LLMs are the best-performing MT systems in all 55 languages. These results should be confirmed using a human-based evaluation, which we leave for future work.

FAQ

Common questions about the WMT24++ benchmark and leaderboard.

What is the WMT24++ benchmark?

What is the WMT24++ leaderboard?

The WMT24++ leaderboard ranks 23 AI models based on their performance on this benchmark. Currently, Nemotron 3 Super (120B A12B) by NVIDIA leads with a score of 0.867. The average score across all models is 0.647.

What is the highest WMT24++ score?

The highest WMT24++ score is 0.867, achieved by Nemotron 3 Super (120B A12B) from NVIDIA.

How many models are evaluated on WMT24++?

23 models have been evaluated on the WMT24++ benchmark, with 0 verified results and 23 self-reported results.

Where can I find the WMT24++ paper?

The WMT24++ paper is available at https://arxiv.org/abs/2502.12404. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does WMT24++ cover?

WMT24++ is categorized under language. The benchmark evaluates text models with multilingual support.

What is the best open-source model on WMT24++?

Nemotron 3 Super (120B A12B) by NVIDIA is the top-ranked open-source model on WMT24++, with a score of 0.867 (rank #1).

Which model offers the best value on WMT24++?

Among models scoring within 10% of the leader, Nemotron 3 Nano (30B A3B) from NVIDIA is the cheapest, at $0.06 per million input tokens with a score of 0.862.

How recent are the WMT24++ leaderboard results?

The WMT24++ leaderboard was last updated in July 2026 and currently includes 23 evaluated models.