Mozilla’s Common Voice Dataset Reaches 100 Languages

Widely spoken in Ghana and other West African countries, Twi is the latest addition to the Common Voice open source language dataset.

The Common Voice V10.0 corpus now contains one hundred languages and is the most multilingually diverse crowdsourced dataset in the world. The newest language on the platform, Twi, is native and bilingual to approximately 18 million people across Ghana, Benin, and the Southeast regions of the Ivory Coast in Western Africa.

This momentous stride is significant to Mozilla’s Common Voice mission of making voice technology more inclusive.

We’re thrilled that Twi is the 100th language on Common Voice. The heart of this project is making it easier for language communities around the world to tap into the possibilities of speech technology — creating a healthier and more open AI ecosystem.

EM Lewis-Jong, Common Voice Product Lead

According to the State of Internet’s Languages Report, the insignificant representation of African languages online continues to reinforce a form of colonial imperialism: ‘’The vast majority of African languages are not supported as an interface language by any of the platforms we surveyed, and as a result, more than 90% of Africans need to switch to a second language in order to use the platform — which for many will mean a European-colonial language." (The report is produced by Whose Knowledge? Oxford Internet Institute, and the Centre for Internet and Society.)

Moreover, African languages account for at least a third of spoken languages worldwide, yet only a handful of products support these languages — despite the majority of these languages existing in oral forms rather than in written text.

Changing this trajectory is a pool of motivated language community builders like Daniel Agyeman, a contributor to Common Voice. This endeavor is one that reconnects him with his culture: “I was born and raised in the UK but I am of Ghanaian descent,” he says. “As a Ghanaian living in the diaspora, I am drawn to activities that will help me to connect with my home country and specifically improve my Twi speaking skills. Currently, I have not come across any usage of Twi in voice technology, therefore I was very excited at the prospect of creating a Twi voice dataset which can be used to create the first ever speech recognition system in Twi.’’

Daniel has been engaging family, friends, and other Ghanaians living in the diaspora to collect sentences in Twi and upload them on the Common Voice platform. So far, the Twi language community has gathered over 40,000 text sentences and is currently inviting native or bilingual speakers to contribute to the initiative by donating their voices and validating other speakers’ contributions.

Through the support of the Gates Foundation, NVIDIA, and GIZ, the 11th Common Voice dataset release is set to exceed 23,000 hours. This community-driven dataset is grown through community mobilization with the support of over 400,000 volunteers around the world.

Mozilla’s Common Voice Dataset Reaches 100 Languages

Widely spoken in Ghana and other West African countries, Twi is the latest addition to the Common Voice open source language dataset.

Related content