Preface: This OP-Ed includes a glossary of terms to explain phrases used to support you as a reader to understand the content. If you are new to the interdisciplinary analysis of technology, language and power, I would like to encourage you to check out the glossary.
If we don’t maintain critical friendships, we impose liberation. Whose words, whose languages are we prioritising when working towards Trustworthy AI?
At the end of February, Whose Knowledge?, The Centre for Internet & Society (India) and Oxford Internet Institute launched the State of the Internet Languages (STIL) report. As a reviewer of the report, I was humbled and refreshed to see the stories of language communities platformed honourably, and in their mother tongues.
The STIL report challenges how power operates across the process of including and excluding spoken and signed languages on the internet. STIL contributors analyse the exclusionary aspects of language inequity online: from the Latin-centrism of UNICODE to marginalisations of queerness and disability in search engines to the lack of agency and sociolinguistic context in the development of tools, data and software.
Hear from Ishan, sharing their experience at the State of the Internet Languages Report Launch (Timestamp: 31:44)
The report also provides a quantitative picture of the language inequities of platforms that people across the world engage with. Analysis that stood out for me were:
- “More than 90% of Africans need to switch to a second language in order to use the platform – which for many will mean a European-colonial language”.
- “Among the 10 most widely spoken languages, Hindi and Bengali are often less well-supported than the others, despite collectively representing a major population of about a billion people.”
- “Even within highly represented languages regions, the difference between eastern European and other European languages shows that there is potential marginalisation even within relatively well-supported regions.”
The statistics strongly resonate in the application and usage of voice technologies for minoritised languages, as shown by Claudia’s contribution to the STIL. Before even talking about if Alexa or Siri work with a language - is there “even a keyboard for typing with languages characters ? Not to mention more advanced technology such as machine translation” (Claudia, STIL)
Voice technology development heavily relies on having access to digital media, computational power and a sociolinguistic understanding to build voice applications that work for people. For example, to support text corpus creation for the Common Voice Dataset, under fair use, some languages have used the Wikipedia Corpus via our sentence scraper.
Mozilla’s Common Voice project is one of many initiatives to support language diversity in digital technologies. Our crowdsourced dataset is made by real people, whose experiences are often similar to the STIL stories and analyses. For example, our Kiswahili and Kinyarwanda Common Voice fellows have shared the importance of creating speech tools that eliminate a speaker's reliance on colonial languages.
I view the State of the Internet Languages report as an invitation to critical friendship. In communities of practice, “critical friends” are people who have agency within a community to openly or “executively” critique common norms, practices or behaviours that happen within a community. Critical friendships are the relational aspects between two or more people or groups who maintain honest communication and a capacity to learn from each other.
Datasets are more than just artefacts for machine learning models but quantify our livelihoods; we therefore all have a stake in engaging with the creation, application and maintenance of datasets like Common Voice.
I think of Claudia’s analysis of the how technology provisions are made for Marginalized Language Communities;
“top-down by big companies, with little or no involvement of speakers’ communities. In this case, a patronizing approach can also be spotted: since very little is available, anything that is provided must be good and welcome by definition”
Claudia Soria, Decolonizing Minority Language Technology, State of Internet Languages Report 2022
This shouldn’t be the norm - consent, resourcing and genuine autonomy should be the standard.
I encourage you to read and sit with the State of the Internet Languages report. At the end of the summary of the State of the Internet Languages report, the writers share specific actions to address the balance of power in digital language inequities. From Open Source projects to governments to publishers, we all can be agents of change - but what are the changes we are running towards and how?
Take in every breath, every syllable and every letter of the report - like I am still doing. The journey to a healthy internet is not a sprint but a marathon and you are not running alone.
“The Unicode Standard provides a unique number for every character, no matter what platform, device, application or language. As computers fundamentally deal with numbers, think 0’s and 1’s. UNICODE has been adopted by all modern software providers and now allows data to be transported through many different platforms, devices and applications without corruption” (UNICODE)
“Sociolinguistics aims to study the effects of language use within and upon societies and the reciprocal effects of social organization and social contexts on language use”(Mallison, 2015).
“Low and High” resourced languages
Low-resource languages and high-resource languages are contested terms. In general, they refer to the scale to which data is available for Natural Language Processing tasks. Availability refers to the process of accessing data as well. For example; Are tools such as search engines able to find the data in the first place? Definition inspired by a reading from “Endangered Languages are not Low-Resourced!, Mika Hämäläinen”
The process of building and reinforcing structural processes and practices that exclude and take away power from people and communities to express or inhibit their human experiences. A marginalized language “is marginalized by historical and ongoing structures and processes of power and privilege, including colonization and capitalism, rather than by the population or the number of speakers” (STIL, Definitions)
Ps. State of Internet Languages Report also includes definitions to help navigate the report.