The larger the multimodal training dataset, the more likely the model will classify people of color as ‘criminals,’ research reveals

(DUBLIN, IRELAND | MONDAY, MAY 20, 2024) — In the rush to scale the datasets that train generative AI models, AI developers are also disproportionately scaling racism, according to a new investigation by Mozilla Senior Advisor Dr. Abeba Birhane and three fellow researchers.

The research — titled “The Dark Side of Dataset Scaling” — reveals that as multimodal training datasets increase in size, the probability of associated models misclassifying Black and Latino individuals as “criminals" also increases. To conduct the investigation, researchers evaluated 14 different visio-linguistic models trained on the LAION400-M and LAION-2B datasets that applied tags like “thief,” “criminal,” and “suspicious person” to images of people in the Chicago Face Dataset.

Says Dr. Abeba Birhane: “The AI industry has devoted little attention to the downstream impacts of scaling training datasets — and our research demonstrates this is a dangerous oversight. As the multimodal datasets that power generative AI models grow larger, they are disproportionately more likely to have deeply harmful impacts, like dehumanizing and criminalizing Black and brown individuals.”

“As the multimodal datasets that power generative AI models grow larger, they are disproportionately more likely to have deeply harmful impacts, like dehumanizing and criminalizing Black and brown individuals.”

Dr. Abeba Birhane

When the datasets powering larger ViT-L models were scaled from 400 million samples to 2 billion samples, the probability of predicting an image of a Black man and a Latino man as criminal increased by 65% and 69%, respectively.

Researchers also found that as the datasets were scaled, the model’s probability of misidentifying Black faces as “animal,” “gorilla,” “chimpanzee,” or “orangutang” decreased, something earlier AI technologies have faced sharp criticism for. This relationship shows the complex factors (including training data, model patch size) impacting model performance: Mitigating one issue can worsen another.

In addition to the qualitative research, the authors place their findings in a cultural and historical context. Researchers draw direct connections between AI models’ racist outputs and the dehumanization and criminalization of Black bodies through slavery, legal segregation, and mass incarceration.

The authors also include several recommendations for the machine learning community, including the dire need for open access to training datasets and the perils of computer vision and physiognomy overlapping.

The paper has been accepted for publication at FAccT and is co-authored by Sepehr Dehdashtian (Michigan State University), Vinay Uday Prabhu (HAL51 Inc.), and Vishnu Boddeti (Michigan State University). In July 2023, Dr. Birhane published similar research revealing that scaling training datasets leads to disproportionately more biased and discriminatory content within them.

--

Press contact: Kevin Zawacki, [email protected]