The Mozilla Common Voice initiative has released a new, expanded data set featuring 16 new languages — like Basaa and Kazakh — and 4,622 new hours of speech.
Mozilla Common Voice is an open-source initiative to make voice technology more inclusive. Contributors donate speech data to a public dataset, which anyone can then use to train voice-enabled technology.
Says Hillary Juma, Common Voice Community Manager: “Internet access is increasingly mediated through speech: Voice assistants and smart speakers give us directions, search for information, connect us to friends, used in assistive technology and much more. Yet this technology doesn’t work for millions of people. For example, neither Amazon’s Alexa, Apple’s Siri, nor Google Home support a single native African language.”
Hillary continues: “By giving individuals the ability to share their speech, we can help ensure all communities have access to voice technology and the opportunity it unlocks."
This new phase in Mozilla Common Voice has been made possible by an exciting partnership with NVIDIA. As Lead Engineer Jenny Zhang explains: “This collaboration helps us tighten the feedback loop between data collection and machine learning teams that actually use the data.”
"By giving individuals the ability to share their speech, we can help ensure all communities have access to voice technology and the opportunity it unlocks."
Hillary Juma, Common Voice Community Manager
The latest numbers
-- This latest release introduces 16 new languages to the Common Voice data set: Basaa, Slovak, Northern Kurdish, Bulgarian, Kazakh, Bashkir, Galician, Uyghur, Armenian, Belarusian, Urdu, Guarani, Serbian, Uzbek, Azerbaijani, Hausa.
-- The top five languages by total hours are English (2,630 hours), Kinyarwanda (2,260) , German (1,040), Catalan (920), and Esperanto (840).
-- Languages that has increased the most by percentage are Thai (almost 20x growth, from 12 hours to 250 hours), Luganda (9x growth, from 8 hours to 80 hours), Esperanto (more than 7x growth, from 100 hours to 840 hours), and Tamil (more than 8x growth, from 24 hours to 220 hours).