Training Data for the Price of a Sandwich: Common Crawl’s Impact on Generative AI

Feb. 6, 2024
AI Bias / AI Fairness, Accountability and Transparency
Common-Crawl-Spider

Overview

Common Crawl is a small nonprofit organization that has created a massive (9.5-plus petabytes), freely available archive of web crawl data dating back to 2008. This data has been valuable for many researchers, but since 2020 when OpenAI published GPT-3, the large language model (LLM) that still powers the free version of ChatGPT, Common Crawl has become one of the most important sources of training data for generative AI. However, Common Crawl’s role in generative AI has received relatively little attention so far. In this report, we look at Common Crawl in-depth and ask what its popularity means for trustworthy AI.

Common Crawl primarily wants to level the playing field for technology development, not just provide data for AI training. It was founded in 2007 with the intention to mimic the way Google crawled the web for its search engine. Common Crawl’s goal is to make both the kinds and the amounts of data that usually only big tech companies like Google have access to available to researchers and smaller businesses.

Common Crawl’s mission as an organization does not easily align with the needs of trustworthy AI development. Its guiding principle is that less curation of the provided data enables more research and innovation by downstream users. Common Crawl therefore deliberately does not remove hate speech, for example, because it wants its data to be useful for researchers studying hate speech. However, such data is undesirable when training LLMs because it might lead to harmful outputs by the resulting models.

Common Crawl does not contain the “entire web,” nor a representative sample of it. Despite its size, there are important limitations on how much of the web is covered. The crawling process is almost entirely automated to prioritize pages on domains that are frequently linked to, which makes domains related to digitally marginalized communities less likely to be included. The language and regional coverage is strongly skewed toward English content. Moreover, a growing number of relevant domains like Facebook and the New York Times block Common Crawl from crawling most (or all) of their pages.

When used as a source for AI training, Common Crawl should be used with care, but such care is often lacking. Due to Common Crawl’s deliberate lack of curation, AI builders do not use it directly as training data for their models. Instead, builders choose from a variety of filtered Common Crawl versions to train their LLMs. However, there is a lack of reflection among AI builders about the limitations and biases of Common Crawl’s archive. Popular Common Crawl versions are especially problematic when used to train LLMs for end-user products because the filtering techniques used to create them are simplistic and often focused on removing pornography or boilerplate text like the names of navigational menu items, leaving lots of other types of problematic content untouched.

Common Crawl and AI builders have a shared responsibility for making generative AI more trustworthy. While Common Crawl was never primarily about providing AI training data, it now positions itself as an important building block for LLM development. However, it continues to provide a source that AI builders need to filter before model training. Both groups can help make generative AI more trustworthy in their own ways. Common Crawl should better highlight the limitations and biases of its data and be more transparent and inclusive about its governance. It could also enforce more transparency around generative AI by requiring AI builders to attribute their usage of Common Crawl. AI builders should put more effort into filtering out more types of problematic content and try to better take into account the various cultural contexts in which their generative AI products are deployed. There is also a need for industry standards and best practices for end-user products to reduce potential harms when using Common Crawl or similar sources for training data. In addition, AI builders should create or support dedicated intermediaries tasked with filtering Common Crawl in transparent and accountable ways that are continuously updated. Long term, there should be less reliance on sources like Common Crawl and a bigger emphasis on training generative AI on datasets created and curated by people in equitable and transparent ways.

Collaborators

J. Bob Alotta, Ziyaad Bhorat, Stella Biderman, Abeba Birhane, Maximilian Gahntz, Lisa Gutermuth, Alex Hanna, Stephen Hood, Bernard Koch, Solana Larsen, Crystal Lee, EM Lewis-Jong, Eeva Moore, Kasia Odrozek, Will Orr, Julian Posada, Kenrya Rankin, Victor Storchan, Apryl Williams.