Common Voice is delighted to announce that our 18th dataset release is now available for download. As part of our commitment to helping make voice technologies more accessible we release a cost and copyright free dataset of multilingual voice clips and associated text data under a CC0 license. The dataset is a community effort, driven by the voice and text contributors, language activists, technologists, academics and other community members that make up Common Voice.

Common Voice 18.0 stats

The Common Voice dataset has grown to a staggering 31,841 hours with 20,789 community-validated hours of speech data. This is an increase of almost 700 hours of speech data since the last dataset release and an increase of 381 newly-validated hours. The 18 dataset is made up of clips from 129 languages, with 5 new languages joining with this release.

New languages joining Common Voice

We’re so excited to have five new languages join the Common Voice datasets and community. Xhosa, Kalenjin, Kidaw'ida, Dholuo and Setswana are available in Common Voice 18. These languages are used by hundreds of millions of people around the world who can now be better supported in voice technologies.

Be a part of Common Voice 19 and beyond

If you’re excited about Common Voice, there are so many ways to join the contributor community. Sharing your voice or writing and contributing original sentences in your language helps build the next dataset. If your language isn’t yet on Common Voice, you can request its addition with this form. We’re also so excited to welcome technical contributions on our open source project on Github.

Feedback

We’re always so excited to hear what you think of the new releases. You can reach us on the Common Voice forums, chat to us in Matrix or email the team directly at [email protected]. We’re especially interested in learning more about what dataset users are building or exploring using the dataset. Better understanding the needs of our dataset users can help us set direction that better supports your needs.