Expanded Indian-Language Data Set Mitigates Hate Speech Online in India, Elsewhere

The open-source index, built by Mozilla Data Futures Lab grantee Tattle, powers a browser extension that automatically identifies and redacts hateful words

(INDIA | APRIL 25, 2024) -- India-based organization Tattle has released an expanded data set on gendered abuse in Indian languages, meant as a resource for people and digital tools to better identify and mitigate hateful content online. Tattle is a community of technologists, researchers, and artists working towards a healthier online information ecosystem in India.

Originally built in 2022, the data set — which powers the browser extension Uli — now features more than 600 entries across the Hindi, Tamil, Malayalam and Indian English languages. Crucially, it now also includes metadata for each entry. This context allows Uli to better understand if and how hate speech is taking place online.

The data set is available here. Read more about Tattle here.

The Uli browser plugin redacts slurs and abusive content, and enables archiving of problematic content, to collectively push back against online gender-based violence. It allows users to automatically blur offensive words; hide problematic posts in news feeds; and capture offensive tweets. The datasets driving the plugin are also used by Trust & Safety teams to detect harmful content in Indian languages.

The data set has been built over two years. Through synchronous online sessions, researchers, activists, and feminist partner organizations have annotated the dataset, providing crucial context on the meaning, usage, and severity of the crowdsourced slurs in Indian languages.

Says Tarunima Prabhakar, lead researcher at Tattle: “Data sets like this are essential to a civil, inclusive internet. They power essential content moderation and safety tools, protecting marginalized communities and mitigating harassment and other harms online. Currently, the Uli data set is one of the most comprehensive, open-source lists for Indian-language content and is critical for ongoing work in AI safety.”

Data sets like this are essential to a civil, inclusive internet.

Tarunima Prabhakar, Tattle

Dharini Priscilla, an annotator for the Tamil list in the dataset, reflected on the importance of South Asian annotators in understanding not just the language but also the political, religious and cultural undertones: “In Tamil, there are so many slurs that change even from one city to another. You don’t see that so much in English. It is broader.”

Tattle is a member of the 2023 Data Futures Lab cohort, alongside four other projects building data sets for the public good. The Data Futures Lab is an experimental space for instigating new approaches to data stewardship challenges. It provides funding, scaffolding for collaboration, convening around emerging ideas, and a place to workshop approaches to data stewardship which give greater control and agency to people.