MMAU Sound
Progress Over Time
Interactive timeline showing model performance evolution on MMAU Sound
MMAU Sound Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Alibaba Cloud / Qwen Team | 7B | — | — |
What is MMAU Sound?
A subset of the MMAU benchmark focused specifically on environmental sound understanding and reasoning tasks. Part of a comprehensive multimodal audio understanding benchmark that evaluates models on expert-level knowledge and complex reasoning across environmental sound clips.
MMAU Sound is a multimodal benchmark evaluating models on multimodal, reasoning, and audio tasks. LLM Stats tracks 1 models on this benchmark, scored on a 0–1 scale. The current average is 0.7, with the leader at 0.7.
Compare leaders on the best AI for multimodal, best AI for reasoning and best AI for audio leaderboards.
Current leaders
Qwen2.5-Omni-7B from Alibaba Cloud / Qwen Team currently leads the MMAU Sound leaderboard with a score of 0.679 across 1 evaluated AI models.
Source paper
- Title
- MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
- Authors
- S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, and 5 others
- Published
- arXiv
- 2410.19168
Abstract
The ability to comprehend audio--which includes speech, non-speech sounds, and music--is crucial for AI agents to interact effectively with the world. We present MMAU, a novel benchmark designed to evaluate multimodal audio understanding models on tasks requiring expert-level knowledge and complex reasoning. MMAU comprises 10k carefully curated audio clips paired with human-annotated natural language questions and answers spanning speech, environmental sounds, and music. It includes information extraction and reasoning questions, requiring models to demonstrate 27 distinct skills across unique and challenging tasks. Unlike existing benchmarks, MMAU emphasizes advanced perception and reasoning with domain-specific knowledge, challenging models to tackle tasks akin to those faced by experts. We assess 18 open-source and proprietary (Large) Audio-Language Models, demonstrating the significant challenges posed by MMAU. Notably, even the most advanced Gemini Pro v1.5 achieves only 52.97% accuracy, and the state-of-the-art open-source Qwen2-Audio achieves only 52.50%, highlighting considerable room for improvement. We believe MMAU will drive the audio and multimodal research community to develop more advanced audio understanding models capable of solving complex audio tasks.
FAQ
Common questions about the MMAU Sound benchmark and leaderboard.