Mozilla’s mission is clear - to ensure that the Internet is a global public resource, open and accessible to all. Increasingly, we use voice as a key interface to the internet. Now, it’s as natural for us to ask our devices a question as it is to type a query into the search bar.
But only if we speak a language that’s well recognized, like English.
Most cloud based voice companies only provide support for languages that are spoken in affluent countries. For example, there is poor voice support for African languages. In line with its mission to create an Internet accessible to everyone, Mozilla supports changing the voice landscape. Technologies like Common Voice and DeepSpeech are helping democratize voice technology - making it easier to collect voice data and create speech recognition applications in more languages.
But creating a voice dataset is no easy task. It requires a mix of technical excellence and strong community partnerships.
And that’s exactly what Didier Mugwaneza, a Machine Learning Engineer with Digital Umuganda, a machine learning-focused startup created from a Mozilla hackathon in 2018, and Josh Meyer, Mozilla Fellow, delivered recently for the Kinyarwanda language. Like many African languages, there is no voice technology support for Kinyarwanda, even though it’s spoken by over 12 million people in Rwanda. This also means that there are significant barriers to building voice-enabled services and applications, for groups who may need them the most - particularly during a time of coronavirus. For this reason, the German development agency GIZ decided to fund the project, contributing to their “Fair Forward” mission of delivering fair artificial intelligence for everyone.
So what did it take to create the first Kinywarwanda speech recognition engine in the world?
The first step in creating a speech recognition model in a new language was to gather written phrases for voice donors to say. These were then uploaded to the Common Voice platform, which is designed to easily collect voice samples from a desktop or mobile device. Next, voice donors were recruited to speak the written phrases out aloud into the Common Voice platform. Digital Umuganda provided training to voice donors, which improved the quality of the samples that were captured.
Digital Umuganda’s success in collecting Kinyarwanda samples is evident through Common Voice’s metrics - the volume of data in Kinyarwanda is second only to English. This was achieved through building productive partnerships on the ground in Rwanda. Didier explains:
“We partnered with the Rwanda Information Society Authority (RISA), who provided a lot of help in collecting data. We also worked with the business ecosystem in Rwanda. They are really supportive of this project because there is high demand from companies for speech technology.”
We partnered with the Rwanda Information Society Authority (RISA), who provided a lot of help in collecting data. We also worked with the business ecosystem in Rwanda. They are really supportive of this project because there is high demand from companies for speech technology.
Didier Mugwaneza, Digital Umuganda
The strong attention to quality and the large volume of voice samples proved invaluable during the next phase - training a speech recognition model. Here, Josh fed the voice samples from Common Voice into Mozilla’s DeepSpeech engine, which created a trained model.
“There were no major hurdles to training the Kinyarwanda speech model because of the work Digital Umuganda did. They did an excellent job energizing the community, and providing them with the know-how to donate high quality data for speech recognition,” says Josh.
Once the model had been trained, it was evaluated to ensure that it recognized women equally as well as men. Because Digital Umuganda had nearly equal voice contributions from men and women, this wasn’t an issue.
The voice models developed by Digital Umuganda using Common Voice data are open source. This means that they are freely available for developers to use as “building blocks” in voice-enabled technologies - for example adding voice to chatbots, in interactive voice response (IVR) systems and for the delivery of voice-enabled government services, particularly where literacy is a challenge. The team has been actively engaging in AI meetups and events to support the technical implementation of these "building blocks".
As we can see, technical excellence and a focus on productive partnerships have combined to produce outstanding results for Kinyarwanda speakers, African voice developers and for advancing Mozilla’s mission of a free and open internet for everyone.
You can get involved with Common Voice by checking out demo mode, downloading the dataset and training your own speech recognition model with DeepSpeech, joining in the conversation on Discourse, or contributing to the Common Voice platform code.