Mozilla Common Voice is an open-source initiative to make voice technology more inclusive. Contributors donate speech data to a public dataset, which anyone can then use to train voice-enabled technology. Voice technology is no longer just the remit of smart speakers - access to banking, government services and health tech are all increasingly voice operated. If we want to make sure nobody is left behind, projects like Common Voice are essential.
Common Voice 8 is the most diverse multilingual open speech corpus in the world. This is largest release yet, thanks to a growing, committed community, and multi-sector resourcing from partners such as Gates, NVIDIA, and GIZ. It is now 18,000 hours, and 13 million voice clips - generated entirely by 200,000+ volunteer contributors around the world.
New languages in Common Voice 8 include Igbo, Marathi, Danish, Norwegian Nynorsk, Central Kurdish, Malayalam, Swahili, Erzya, Moksha, Macedonian and Santali (Ol Chiki).
Our communities of contributors around the world have collaborated, inspired and supported people in our crowdsourcing efforts to make this dataset possible. Each member provides a unique and lived perspective of their language's experiences and cultural context.
As part of this dataset release we would like to highlight the contributions of; the Common Voice Language Reps, Chris Chinenye Emezue, Joan Montané and Nart for exceptional sentence collection efforts via the CC0 process, Bülent Özden for community building for Turkish Community and Stefania Deleprete for their Common Voice Advocacy efforts. We would also like to congratulate the Uzbek, Luganda, Serbian, Hausa, Belarusian and Abkhaz communities for their amazing growth.
Partners like NVIDIA make use of the data to fuel exciting open source innovation projects. Research Scientist Vitaly Lavrukhin says “the latest release of Mozilla Common Voice is a great thing for the research communities. The data continues to be a core component of NVIDIA’s open source NeMo Automatic Speech Recognition models and we congratulate the team on significant growth to the dataset. NVIDIA will also release data preprocessing scripts in NeMo to facilitate reproducibility of research.”
The collaborative support of Gates Foundation’s, GIZ and FCDO in growing digital innovation to address inequality in East Africa through voice innovation is also bearing fruit, as Swahili has hit 500 hours in a matter of months. This is thanks to the work of amazing community fellows Britone Mwasaru (Kenya) and Rebecca Ryakitimbo (DRC/Tanzania) and machine learning fellow Kathleen Siminyu (Kenya).
You can download the Common Voice dataset here for free.