Benchmarks/audio/VocalSound

VocalSound

A dataset for improving human vocal sounds recognition, containing over 21,000 crowdsourced recordings of laughter, sighs, coughs, throat clearing, sneezes, and sniffs from 3,365 unique subjects. Used for audio event classification and recognition of human non-speech vocalizations.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on VocalSound

State-of-the-art frontier
Open
Proprietary

VocalSound Leaderboard

1 models
ContextCostLicense
1
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
7B
Notice missing or incorrect data?

FAQ

Common questions about VocalSound

A dataset for improving human vocal sounds recognition, containing over 21,000 crowdsourced recordings of laughter, sighs, coughs, throat clearing, sneezes, and sniffs from 3,365 unique subjects. Used for audio event classification and recognition of human non-speech vocalizations.
The VocalSound paper is available at https://arxiv.org/abs/2205.03433. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The VocalSound leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Qwen2.5-Omni-7B by Alibaba Cloud / Qwen Team leads with a score of 0.939. The average score across all models is 0.939.
The highest VocalSound score is 0.939, achieved by Qwen2.5-Omni-7B from Alibaba Cloud / Qwen Team.
1 models have been evaluated on the VocalSound benchmark, with 0 verified results and 1 self-reported results.
VocalSound is categorized under audio. The benchmark evaluates audio models.