A picture of Preben Vangberg and Leena Farhat, one of the winning teams from Mozilla’s Common Voice, ‘Our Voice’ competition.

Profile feature of Preben Vangberg and Leena Farhat, one of the winning teams from Mozilla’s Common Voice, ‘Our Voice’ competition.

Can tech breathe new life into a language? The simple answer is yes, for machine learning researchers Preben Vangberg and Leena Farhat. They are developing speech-to-text models to build voice assistant applications and voice-enabled chatbots for the Swiss language Romansh.

But their vision is much bigger than that. Despite Romansh being one of the official languages, barely 0.5% of Swiss residents (about 60,000 individuals) declared Romansh as one of their official languages in 2013. By teaching bots to speak — The duo hopes to create a cultural shift where the language becomes an integral part of day-to-day interaction.

"Language is part of people’s culture. And working towards making a cultural shift requires certain tools and infrastructure. Our starting point is creating a speech-to-text framework for Romansh and its dialects. That on its own is revolutionary, as this is one of the pioneering projects to do so," says Farhat.

While most people in Switzerland are multilingual, speaking at least another official language (French, German, or Italian), Romansh can be considered a regional language, primarily spoken in the canton of Graubünden, with five dialects: Sursilvan, Sutsilvan, Surmiran, Putèr and Vallader — Two of which are available in the Common Voice data corpus, Sursilvan and Vallader.

Read about Mozilla's Our Voice Competition winners.

Due to limited resources available in the language, for instance, audio records and text, Vangberg and Farhat mainly focused on training their model using text. The team sourced old Swiss newspapers dating back from 1997 until the present to build their text archive. "We wanted to do the most with the public data available," says Vangberg.

"Using texts in the two major dialects (Sursilvan and Vallader), we noticed that by training bespoke language models for the target dialect using text, we could force the output of the acoustic model to be more aligned with the orthography of the target dialect. This drastically improved the word and character error rates. To the degree that the output was actually usable!" Vangberg explains. "While this might not be the ideal case, — using the existing acoustic model for a different dialect in conjunction with text data, it offers a good alternative for languages with limited audio archives and resources", he adds.

With a character error rate of 7.5%, transcription applications using their model would be accurate at converting audio into text with minimal errors, with 0% indicating a correct transcription.

"Our goal is to continue training and improving the accuracy levels so that anybody can seamlessly use and build applications using our model. The common voice dataset has been very valuable in giving us a head start," concludes Vangberg.

Choosing to focus on minority languages as the scope of their research has been immensely rewarding, says Farhat, "this is because the challenges are unique — you can’t compare training a model with a vast data set like that of English and that of a low resource language like Romansh — finding solutions to obtain good results despite the limitations is what makes this work interesting! And the societal impact is even more significant!"