This is a profile of Fair EVA, Mozilla Technology Fund awardee. Fair EVA comprises of open source collaborators studying bias in voice biometrics design and use.
You have probably heard this before, “This call may be monitored or recorded for quality assurance and training purposes”. You were probably calling a customer service center of a bank, an insurance company, or any other service provider. While this might have sounded like a standard procedure, the unknown factor is your voice might have been used to identify and authenticate your credentials.
This is known as passive voice biometric use where a caller’s voice is used to verify their identity. Companies and call centers using this technology only require 20 seconds of casual conversation voice to authenticate.
A contrasting case is the active voice biometric use, where a user is required to say a particular phrase to activate the platform’s services. For instance, the passphrase “At Safaricom, my voice is my password,” is used as a voice password to access mobile money wallet features for M-Pesa users in Kenya.
As the demand for secure digital services rises, technologists are turning to voice biometrics for user authentication. While the need to identify people has always existed, speaker and voice verification technology springs from the individual uniqueness of the human voice: Each person's voice has a distinct profile, functioning like a fingerprint.
Voice ID provides easier access to digital platforms without the difficulty of remembering passwords. It also provides an alternative for using visual designs, and is helpful in cases of low read and write literacy levels. Voiceprint is currently used as a home security feature, in banking applications, and in processing insurance claims. In Mexico, one of the largest financial institutions, BBVA, uses voice biometrics as proof of life for pensioners accessing their benefits.
However, Voice ID’s close relatives - speech recognition chatbots such as Siri, Alexa, Google Assistant - are plagued with tremendous racial bias, particularly in detecting accents and speech of non-English speakers. Could this bias also find its way into voice biometric services? Wiebke Toussaint Hutiri, project lead at Fair EVA predicts a worrying trend, that could further exclude marginalized users from using services secured with voice biometrics, increase the likelihood of intrusion, or deepen surveillance and other discriminatory profiling.
Hutiri views bias as an extension of unfair treatment, “…that results in disparate technology performance for different users. This type of bias may have different reasons for occurring, but a particular stubborn factor is during machine learning development. While these effects are largely unintended, they can still result in some groups of people experiencing disproportionate harm.”
She adds, “What’s interesting in this technology pipeline is that voice becomes the first access point, [and with it] a double-edged sword. On one hand, we can see it as a security feature that provides intrusion detection - protecting unauthorized people from access, but it can also deny entry to persons who should but do not get identified correctly. Bias is at least partially preventable, which is why we are looking at ways of identifying and addressing it in voice biometrics.”
Moreover, she is keen to distinguish between bias and discrimination. She defines discrimination as the intentional act of exercising prejudice, “[...] stereotyping based on sensitive or protected attributes of individuals.” Highly alarming is that voice biometrics is also being used unscrupulously to heighten surveillance in some prisons in the USA, where phone call data between inmates and their loved ones are recorded, stored, and analyzed, under the cover of providing additional security features. This data can be used to identify children, families, friends, or other external social networks of inmates.
How accurate is voice biometrics in identifying a user, and are the inaccuracies in proper user identification signs of bias? Hutiri says the telltale signs emerge when answering these questions: “For whom are these technologies designed for, and which assumptions have been made in the design process?”
For whom: This critical step of the equation requires a demographic audit of where these products are currently developed and where the applications are being deployed. Fair EVA’s team is currently creating a database of vendors and service providers who are using these technologies to develop voice ID applications. “...Just trying to do that tracing is already challenging,” Hutiri explains. “The other hurdle is that voice technology in itself is very application-specific. What assumptions or specificities have the technology designers incorporated into the applications to cater to voice changes? For instance; race, gender, age group, sickness, or even mood?”
Voice biometrics bias is innately context and application-specific. If we test the technology without knowing which scenarios have been considered by the developers, then we fall into an assumptions pitfall.” Instead, she proposes “...product design transparency and accessibility of test evaluations.”
Wiebke Toussaint Hutiri - Fair EVA
Because different applications carry different designs, Hutiri’s team is critical of a one-size-fits-all structure in studying bias. “Voice biometrics bias is innately context and application-specific. If we test the technology without knowing which scenarios have been considered by the developers, then we fall into an assumptions pitfall.” Instead, she proposes “...product design transparency and accessibility of test evaluations.”
For instance, “In a case where we don’t know how the technology was tested - but the product records a 98% accuracy, it’s not clear whether the accuracy is in keeping intruders off the platform or the accuracy is in proper identification of users. There's a trade-off between the two, and errors in both cases carry consequences, with a potential likelihood of bias occurring in either.”
To evaluate bias, Hutiri’s team is creating a toolbox with guidelines for conducting context-specific evaluation data sets, and developing a Python library to audit bias during the development of voice biometrics. “We realized that we cannot create a data set that’s going to cover every possible scenario, but instead, direct how evaluation should happen. We can then use insights from developing the evaluation guideline to propose reporting standards for how technology performance should be communicated to consumers," Hutiri explains.
Their team is currently developing a public speech donation call for data collected by Google Assistant and using that data to develop an evaluation data set.