VocalSound

A dataset for improving human vocal sounds recognition, containing over 21,000 crowdsourced recordings of laughter, sighs, coughs, throat clearing, sneezes, and sniffs from 3,365 unique subjects. Used for audio event classification and recognition of human non-speech vocalizations.

Qwen2.5-Omni-7B from Alibaba Cloud / Qwen Team currently leads the VocalSound leaderboard with a score of 0.939 across 1 evaluated AI models.

Paper
About this benchmark

What VocalSound measures

VocalSound is a audio benchmark that evaluates large language models on audio tasks. LLM Stats tracks 1 model on this benchmark, with a maximum possible score of 1. Current average across reported models is 0.9, with the leader reaching 0.9.

Compare leaders on the best AI for audio leaderboards.

Publication

Paper
Vocalsound: A Dataset for Improving Human Vocal Sounds Recognition
Authors
Yuan Gong, Jin Yu, James Glass
Published

Abstract

Recognizing human non-speech vocalizations is an important task and has broad applications such as automatic sound transcription and health condition monitoring. However, existing datasets have a relatively small number of vocal sound samples or noisy labels. As a consequence, state-of-the-art audio event classification models may not perform well in detecting human vocal sounds. To support research on building robust and accurate vocal sound recognition, we have created a VocalSound dataset consisting of over 21,000 crowdsourced recordings of laughter, sighs, coughs, throat clearing, sneezes, and sniffs from 3,365 unique subjects. Experiments show that the vocal sound recognition performance of a model can be significantly improved by 41.9% by adding VocalSound dataset to an existing dataset as training material. In addition, different from previous datasets, the VocalSound dataset contains meta information such as speaker age, gender, native language, country, and health condition.

Alibaba Cloud / Qwen TeamQwen2.5-Omni-7B leads with 93.9%.

Progress Over Time

Interactive timeline showing model performance evolution on VocalSound

State-of-the-art frontier
Open
Proprietary

VocalSound Leaderboard

1 models
ContextCostLicense
1
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
7B
Notice missing or incorrect data?

FAQ

Common questions about VocalSound.

What is the VocalSound benchmark?

A dataset for improving human vocal sounds recognition, containing over 21,000 crowdsourced recordings of laughter, sighs, coughs, throat clearing, sneezes, and sniffs from 3,365 unique subjects. Used for audio event classification and recognition of human non-speech vocalizations.

What is the VocalSound leaderboard?

The VocalSound leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Qwen2.5-Omni-7B by Alibaba Cloud / Qwen Team leads with a score of 0.939. The average score across all models is 0.939.

What is the highest VocalSound score?

The highest VocalSound score is 0.939, achieved by Qwen2.5-Omni-7B from Alibaba Cloud / Qwen Team.

How many models are evaluated on VocalSound?

1 models have been evaluated on the VocalSound benchmark, with 0 verified results and 1 self-reported results.

Where can I find the VocalSound paper?

The VocalSound paper is available at https://arxiv.org/abs/2205.03433. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does VocalSound cover?

VocalSound is categorized under audio. The benchmark evaluates audio models.

What is the best open-source model on VocalSound?

Qwen2.5-Omni-7B by Alibaba Cloud / Qwen Team is the top-ranked open-source model on VocalSound, with a score of 0.939 (rank #1).

How recent are the VocalSound leaderboard results?

The VocalSound leaderboard was last updated in June 2026 and currently includes 1 evaluated models.

More evaluations to explore

Related benchmarks in the same category

View all audio
CoVoST2

CoVoST 2 is a large-scale multilingual speech translation corpus derived from Common Voice, covering translations from 21 languages into English and from English into 15 languages. The dataset contains 2,880 hours of speech with 78K speakers for speech translation research.

audioaudio
4 models
MMAU

A massive multi-task audio understanding and reasoning benchmark comprising 10,000 carefully curated audio clips paired with human-annotated natural language questions spanning speech, environmental sounds, and music. Requires expert-level knowledge and complex reasoning across 27 distinct skills.

audiomultimodal
2 models
Big Bench Audio

Big Bench Audio is an audio reasoning benchmark adapted from a subset of Big Bench Hard, with text questions converted to spoken audio. It evaluates the reasoning ability of speech-to-speech and audio language models on tasks delivered as audio input, with accuracy scored by an independent evaluation (Artificial Analysis).

audioaudio
1 models
Common Voice 15

Common Voice is a massively-multilingual collection of transcribed speech intended for speech technology research and development. Version 15.0 contains 28,750 recorded hours across 114 languages, consisting of crowdsourced voice recordings with corresponding transcriptions.

audioaudio
1 models
CoVoST2 en-zh

CoVoST 2 English-to-Chinese subset is part of the large-scale multilingual speech translation corpus derived from Common Voice. This subset focuses specifically on English to Chinese speech translation tasks within the broader CoVoST 2 dataset.

audioaudio
1 models
GiantSteps Tempo

A dataset for tempo estimation in electronic dance music containing 664 2-minute audio previews from Beatport, annotated from user corrections for evaluating automatic tempo estimation algorithms.

audioaudio
1 models