Kiswahili is Africa’s most spoken language and in the top 10 most widely spoken languages in the world. With more than 200 million speakers already, interest in the language will likely increase, with the United Nations recently declaring 7 July as the World Kiswahili language day - the first African language to be recognised in this way. Yet even as the language proliferates, its origin story is still greatly debated and the colonial influences apparent in this origin story often overlooked.

Given our efforts to build a Kiswahili dataset on Mozilla Common Voice, as well as training automatic speech recognition models for the language, we sought to engage a team of Kiswahili linguists and language experts. Between September and November of 2021, we organised a “Linguists’ Engagement” seeking to interrogate whether linguistics studies of the language have codified the diversity of different Kiswahili speakers. Our main intention from the interaction was to learn what key factors of the language we would need to keep in mind to ensure our work is inclusive and able to serve the diversity of Kiswahili speakers that exists. It greatly surprised us that beyond identifying accents, dialects and language variants, the colonial history of the East African coast as well as the history of the Swahili people and the Kiswahili language presented socio-linguistic learnings that are now core to how we view inclusive building of the Kiswahili dataset on Common Voice.

There are two main origin stories of the Kiswahili language.

The first is that Kiswahili is a pidgin, or creole, of Arabic and Bantu languages and that it came about when the Arabs came to Eastern Africa (EA) for trade purposes and began interacting with locals, who were Bantu speakers, in the 19th century. As author Francis Nesbitt writes:

Linguistic studies show that situations of contact, where two linguistic communities interact, lead to the emergence of pidgins (simplified registers) that allow the two or more distinct linguistic groups to communicate.

Further to this theory, is that Kiswahili is a pidgin or a mixture that includes several other languages; Portuguese, Indian and Persian, as these are some of the other nationalities that were present along the EA coast, at the same time, for trading purposes. The language has a significant number of borrowed words, which supports this theory; ‘meza’, the Kiswahili word for table, is of Portuguese origin and ‘chapati’ is a Kiswahili word of Indian origin.

A second theory states that the term 'Kiswahili' is of Arabic origin, while the language itself is Bantu. That when the Arabs came to EA and found those living there, along the coast, they referred to them as 'Saheel', which is Arabic for ‘the coast’, and that over time this term evolved to become Kiswahili for the language and Swahili (or Waswahili in plural), referencing the people. Evidence of Kiswahili as a Bantu language dates back to as early as the 2nd century AD in a document called 'Periplus of Erythrean Sea' written by an anonymous Greek author detailing the early expansion of Swahili civilisations towards Somalia, Kenya and Zanzibar.

The Standard Kiswahili, or Kiswahili Sanifu, we know today was created through the standardisation of a dialect known as Kiunguja, which originated from the Zanzibar and Pemba Islands of Tanzania. Kiunguja is one of over 23 known dialects of Kiswahili. In the book ‘Machozi Yameniishia’, the poet Mohammed Ghassani, is critical of the choice of Kiunguja as the basis for Kiswahili Sanifu, and many Kiswahili writers and academics share this sentiment. The process was entirely owned by colonial authorities without the involvement of native speakers.

In his book ‘Decolonising the Mind: The Language of African Literature’, Ngugi wa Thiong’o talks about the fact that language is an important tool, both for the coloniser and for the colonised. With the former wielding the weapon and the latter being the object, language is used as a cultural bomb, the effect of which is to:

annihilate a people's belief in their names, in their languages, in their environment, in their heritage of struggle, in their unity, in their capacities and ultimately in themselves. It makes them see their past as one wasteland of non-achievement and it makes them want to distance themselves from that wasteland. It makes them want to identify with that which is furthest removed from themselves, for instance, with other peoples’ languages rather than their own.”

The efforts to standardise Kiswahili were driven by missionary groups. On one hand, German missionaries who were keen on using the dialects from Mombasa (Kenya), Pate (Kenya) and Tanga (Tanzania), which are areas where they were stationed. On the other hand, English missionaries were keen on using Kiunguja because it was the language where they were stationed, in Zanzibar and neighbouring islands, and therefore the language/dialect they were most familiar with. In 1930 the Inter-territorial Language Committee chose the Zanzibari Kiswahili dialect, Kiunguja, as the source of Standard Kiswahili, a decision influenced by British colonial rule over East African territories.

To ensure the propagation of Kiswahili Sanifu, the Inter-territorial Language (Kiswahili) Committee, made up entirely of Europeans, would approve textbooks used to teach the language in schools. Textbooks were written and reviewed by Europeans and through this, vocabulary changed with some words being shortened and their meaning completely changing. Therefore, the more this language was standardised, the further it drifted away from what native Kiswahili speakers knew as Kiunguja. Some see the standardisation as a tool, once again for the coloniser, to massacre other dialects. Its use and calculated propagation in schools resulted in the reduced use of other related dialects and in a phenomenon known as ‘Linguistic Insecurity’. Write authors Wilma Bucci and Milton Baxter:

Linguistic insecurity is the negative self-image of a speaker regarding his or her own speech variety or language. It might happen if the speaker compares their phonetic and syntactic characteristics of speech with those characteristics of what is perceived to be correct.

Post-independence, Kiswahili Sanifu has been largely ratified by local populations. It has been used as a national language in Tanzania, Kenya, the DRC and Uganda. It is an official language of the East African Community (EAC) as well as the Southern African Development Community (SADC). In Tanzania, it is used as the medium of instruction in schools. The language has enjoyed great government support in the region, particularly in Tanzania. One of the greatest contributions of Julius Nyerere, the first president of Tanzania, was to push for the growth of Kiswahili in East and Central Africa as he believed that it could promote African unity, as it had done in Tanzania. Kiswahili scholars in EA continue to actively grow the language with literature departments at universities and research bodies continuing to publish new editions of Kiswahili dictionaries. Language bodies such as Baraza la Kiswahili la Taifa (BAKITA) in Tanzania and Chama cha Kiswahili cha Taifa (CHAKITA) in Kenya are responsible for the promotion of the Kiswahili language and publishing houses, notably in Tanzania, contribute to a growing body of literary works in circulation in the language.

From linguists and language experts, we learn that building in isolation as technologists, as developers and as NLP researchers, is not the right thing to do. If we proceed without conscious consideration, we risk alienating some of the populations that this digital resources should benefit.

Within the Common Voice project, our work includes the development of subsets of the Kiswahili dataset that are representative of smaller dialects and variants with the majority of the dataset constituting Kiswahili Sanifu. The main purpose of these subsets is to enable us to quantitatively evaluate how our models and downstream applications perform on demographics that speak related dialects and variants.

Language is political in its origins, connected to our colonial context, and so in building technologies for under-represented languages, we must work recognising this context. We intend for this quantification of potential bias to be a starting point, and if performance is indeed degraded for certain demographics, we would like to work to make resources available to developers, so that depending on the particular local contexts they are building applications for, they are able to fine-tune/localise so as to improve performance where necessary.

For a comprehensive write-up on our “Linguists’ Engagement” and subsequent outputs, read our paper, “Corpus Development of Kiswahili Speech Recognition Test and Evaluation sets, Preemptively Mitigating Demographic Bias Through Collaboration with Linguists”.