Fondation Mozilla - Breaking Bias — Diverse Speech Training? For Virtual Assistants, It’s Virtually Non-Existent

“Alexa, set a timer for 10 minutes.” Ten years ago this sentence wouldn’t have been considered an interaction with a computer, but nowadays it is. Virtual assistants and voice-based interactions with artificial intelligence have become commonplace. It isn’t just Amazon’s Alexa, which has somehow made its way into vacuums and refrigerators. Apple’s Siri, Google’s assistant, music apps, luxury cars, even the automated system you hear when calling your bank or tech support all rely on users’ voices to submit queries, navigate menus and, ultimately, complete tasks.

Here’s the tricky part: unlike smart speakers, every person sounds different when they speak.

Not Seen, Not Heard

Society is full of biases, whether we’re all fully aware of them or not. A growing problem comes when we (society) inadvertently inject those biases into the tech we (again, society) create. When Google mistakenly assigns the label “gorilla” to Black people in photos, it’s a reminder of the importance of diversity in the training data we use when we teach AI about what it’s seeing or hearing. It’s also a reminder of who needs a seat at the table as we craft these tools.

Which brings us to virtual assistants. When Amazon instructs Alexa on how to parse phrases like, “Alexa, set a timer for 10 minutes,” what training data does it use? What accents were part of the lesson? Which languages?

This is something that artist and engineer Johann Diedrick often thinks about, especially when he sees his family interact with voice-based AI. “Both of my parents are Jamaican, and what brought me into this space was overhearing my parents, specifically my dad, speaking to Alexa and seeing how it didn’t recognize his accented speech,” said Diedrick during a recent Mozilla Dialogues and Debates panel. Diedrick is the director of engineering at Somewhere Good and a 2021 Mozilla Creative Media Award recipient.

“It dawned on me that we talk about these technologies working with English, but English is a spectrum,” said Diedrick. His dad, for example, changes his accent to help Alexa understand him better, something Diedrick likened to code-switching — the act of temporarily changing how one expresses their culture when around others in order to be better understood, appear non-threatening or for a number of other reasons. “In the external world this happens for white ears, and now we’re seeing it happen for AI ears,” says Diedrick.

Whose Voice Gets Heard?

Those who speak in English in accents that Siri or Alexa aren’t accustomed to can attempt to change how they speak, but not everyone can so easily switch to Siri-friendly speech. Kọ́lá Túbọ̀sún, a writer and linguist formerly with Google’s natural language processing team, notes that those who speak Yorùbá, for example, used predominantly in West Africa, things are tougher. “Siri exists in Swedish, Norwegian, Icelandic and Scandinavian languages but it doesn’t exist in Yorùbá, a single language spoken by over 30 million people,” Túbọ̀sún said in a recent online discussion. While Siri is available in those smaller European languages, it’s not available for many African languages used by millions of people. “It is about who is at the table deciding why we may need this service.”

Other experts have chimed in on this subject too. The New York Times has noted how studies show a large “race gap” when it comes to how well it interprets similar phrases said by Black users and white users — a chasm caused by training data. Meanwhile, Scientific American spoke with experts who blame training data for virtual assistants’ limited scope. As Safiya Noble, Associate Professor in the Departments of Information Studies and African American Studies at UCLA and author of best-selling title Algorithms of Oppression points out: “Certain words mean certain things when certain bodies say them, and these [speech] recognition systems really don't account for a lot of that.”

So what’s being done to curb this? Diedrick and Túbọ̀sún are working on diagnosing and addressing the problem. There’s Diedrick’s Dark Matters project, which lets you assume the role of a machine-learning researcher training voice AI. Túbọ̀sún’s YorubaName.com is an open source directory of crowdsourced name pronunciations, offering developers creating text-to-speech systems a reference point should they need it.

Another platform working to address bias in voice recognition tech is Mozilla Common Voice. The team’s work centers around creating a crowdsourced dataset that lets anyone contribute their voice to help train voice AI in the future. “It’s disempowering that many people have to mask their accent or speak in a second language to interact and engage with most commercial voice recognition tools,” says Hillary Juma, Common Voice’s Community Manager. Since the project’s launch in 2017, it has grown to incorporate over 80 languages, most recently expanding to cover more African languages.

“Some native African languages that we’ve launched for voice data collection include Bassa, Kiswahili, Kinyarwanda and Luganda,” says Hillary. “There are also new languages in the process of being launched through the efforts of Common Voice contributors and language communities, including Igbo, Twi and Somali.”

Without support for African languages, voice assistants like Siri and Alexa leave out a large segment of the global population, and set implicit standards for who it’s acceptable to exclude when launching a voice-based product. “What language hierarchies are we reinforcing if we don’t design them for linguistic diversity?” asks Hillary.

The issue of lack of diversity in training data for voice-based computer systems is not a problem that will be solved overnight. But it’s also not one that is solely a technological problem to solve: “One of the biggest misconceptions about technology is that technical skills are the only skills that can influence a product,” says Hillary. She notes how equally important expertise around accessibility, equity, and language can be to relational skills like communication and policy-making. Projects like Dark Matters, Yoruba Name and Common Voice offer pathways to lift every voice, making sure everyone can be heard no matter how they sound.