The Common Voice team is honoured to be able to announce that the 20th version of our multilingual, open speech dataset is now available.
This dataset release sees Aragonese, IsiNdebele (sometimes also known as Southern Ndebele), Southern Sotho and Tupuri to the dataset for the first time. The dedicated language activists, translators and contributors for these new languages have done amazing work creating open speech data for their languages that anyone can build on. These new languages bring the total number of languages in the Common Voice Scripted Speech dataset to 133 in total.
This release includes contributions made through December 6th, 2024 and adds 566 new hours of speech and 515 newly validated hours of speech.
This brings the total hours of available speech data in the Common Voice dataset to 33,150 hours. 22,108 hours has had quality assurance (“validation”) crowdsourced through the community. This dataset is a monument to the power of community.
We’re always excited to hear feedback from contributors, dataset users and language activists. We are especially excited to learn more about what people are researching or building using the dataset. If you want to chat to us about it, you can join our new Discord community or email us at [email protected]