WMT24++
WMT24++ is a comprehensive multilingual machine translation benchmark that expands the WMT24 dataset to cover 55 languages and dialects. It includes human-written references and post-edits across four domains (literary, news, social, and speech) to evaluate machine translation systems and large language models across diverse linguistic contexts.
Progress Over Time
Interactive timeline showing model performance evolution on WMT24++
State-of-the-art frontier
Open
Proprietary
WMT24++ Leaderboard
19 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | 120B | — | — | |||
| 2 | 32B | 262K | $0.06 / $0.24 | |||
| 3 | Alibaba Cloud / Qwen Team | — | — | — | ||
| 4 | Alibaba Cloud / Qwen Team | 397B | 262K | $0.60 / $3.60 | ||
| 5 | Alibaba Cloud / Qwen Team | 122B | 262K | $0.40 / $3.20 | ||
| 6 | Alibaba Cloud / Qwen Team | 27B | 262K | $0.30 / $2.40 | ||
| 7 | Alibaba Cloud / Qwen Team | 35B | 262K | $0.25 / $2.00 | ||
| 8 | Alibaba Cloud / Qwen Team | 9B | — | — | ||
| 9 | Alibaba Cloud / Qwen Team | 4B | — | — | ||
| 10 | Google | 27B | 131K | $0.10 / $0.20 | ||
| 11 | Google | 12B | 131K | $0.05 / $0.10 | ||
| 12 | 2B | — | — | |||
| 12 | Google | 8B | 32K | $20.00 / $40.00 | ||
| 14 | Google | 4B | 131K | $0.02 / $0.04 | ||
| 15 | Alibaba Cloud / Qwen Team | 2B | — | — | ||
| 16 | 2B | — | — | |||
| 16 | Google | 8B | — | — | ||
| 18 | Google | 1B | — | — | ||
| 19 | Alibaba Cloud / Qwen Team | 800M | — | — |
Notice missing or incorrect data?
FAQ
Common questions about WMT24++
WMT24++ is a comprehensive multilingual machine translation benchmark that expands the WMT24 dataset to cover 55 languages and dialects. It includes human-written references and post-edits across four domains (literary, news, social, and speech) to evaluate machine translation systems and large language models across diverse linguistic contexts.
The WMT24++ paper is available at https://arxiv.org/abs/2502.12404. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The WMT24++ leaderboard ranks 19 AI models based on their performance on this benchmark. Currently, Nemotron 3 Super (120B A12B) by NVIDIA leads with a score of 0.867. The average score across all models is 0.607.
The highest WMT24++ score is 0.867, achieved by Nemotron 3 Super (120B A12B) from NVIDIA.
19 models have been evaluated on the WMT24++ benchmark, with 0 verified results and 19 self-reported results.
WMT24++ is categorized under language. The benchmark evaluates text models with multilingual support.