Mozilla investigates Common Crawl’s influence as a backbone for Large Language Models: its shortcomings, benefits, and implications for trustworthy AI

When OpenAI rolled out its text generator ChatGPT in 2022, few paid attention to the outsized importance of its chief training dataset, Common Crawl.

Now, Mozilla’s new study “Training Data for the Price of a Sandwich: Common Crawl’s Impact on Generative AI” shows how Common Crawl laid the infrastructural foundation that shaped today’s burgeoning generative AI boom.

Common Crawl is a small nonprofit organization that is largely unknown to the broader public, but has become pivotal to the development of generative AI as the largest freely available source of web crawl data.

The study explores Common Crawl’s role in generative AI, the benefits and risks of its popularity among AI builders, and how AI builders have often used its data uncritically. The researcher highlights Common Crawl’s approach to tackling data quality, the limitations of its data, and what builders’ dependence on Common Crawl means for furthering trustworthy AI.

Common Crawl’s massive dataset is more than 9.5 petabytes large and makes up a significant portion of the training data for many Large Language Models (LLMs) like GPT-3, which powers the free version of ChatGPT. Over 80% of GPT-3 tokens (a representation unit of text data) stemmed from Common Crawl. Many models published by other developers likewise rely heavily on it: the study analyzed 47 LLMs published between 2019 and October 2023 that power text generators and found at least 64% of them were trained on Common Crawl. Its popularity among AI builders comes with positive and negative implications, Mozilla’s research argues. It improved the openness and transparency of LLM research and development, but it also led to many models being trained on biased and toxic data as well as on copyrighted materials.

Indeed, most recently, Common Crawl was cited as a key player in the copyright infringement case of the New York Times against OpenAI and Microsoft. The New York Times highlighted that its content made up a significant proportion of Common Crawl’s data at the time OpenAI launched ChatGPT, and therefore it very likely made up a significant proportion of GPT-3’s training data as well. This added to the growing list of copyright cases between content producers and generative AI companies.

For the report, Mozilla researchers conducted in-depth interviews with Common Crawl’s director and main crawl engineer and analyzed online documentation and discussions from the project.

Says Stefan Baack, Mozilla researcher and the report’s author: “Common Crawl has helped to make generative AI more transparent and audible, but it is a problematic source to train LLMs that needs to be used with care. Yet, this care is often lacking among AI builders. With our report, we highlight the consequences of using Common Crawl uncritically and show what both Common Crawl and AI builders can do to make generative AI more trustworthy.”

Common Crawl has helped to make generative AI more transparent and audible, but it is a problematic source to train LLMs that needs to be used with care. Yet, this care is often lacking among AI builders.

Stefan Baack, Research and Data Analyst, Mozilla Foundation

Key findings and recommendations from the report:

  • Common Crawl is massive, but it is not a copy of the “entire web”
    There are limitations and biases in Common Crawl's coverage. For example, it mostly contains English pages, and its regional coverage varies. Moreover, a growing number of relevant domains like Facebook and the New York Times block Common Crawl from crawling most (or all) of their pages. Uncritically treating Common Crawl as a "copy of the web" declares a relatively small subsection of primarily English web pages as being representative of the entire world.

  • Common Crawl’s mission doesn’t easily align with the needs of trustworthy AI development, but AI builders often use its data without the necessary care
    Common Crawl wants its data to be useful for many different use cases, including, for example, research on hate speech. To that end, its massive datasets deliberately include problematic content. That’s why more robust data filters are necessary to weed out unwanted content and train LLMs on what content to flag or remove.
    However, filtered versions of Common Crawl popular among AI builders rely on simplistic automated filtering techniques. For example, one approach is to only keep pages from Common Crawl that are similar to a high quality reference dataset. To determine “high quality,” AI builders have used engagement metrics like upvotes from user generated platforms such as Reddit. But this is inadequate in removing problematic content or bias. This approach only caters to the platform’s major audience that is mostly white and male, which ultimately translates to a fraction of web users determining which contents LLMs are trained on and creates biased datasets that reproduce stereotypes and misrepresent the experiences of underrepresented groups. Another method used in popular filtered Common Crawl versions is to remove pages that contain any word in the “List of Dirty, Naughty, Obscene, and Otherwise Bad Words,” which is composed of popular pejoratives. However, most of these words are related to sex and pornography and they tend to incorrectly flag non-toxic words used by the LGBTQIA+ communities and further undermine their representation.

  • Common Crawl and AI builders have a shared responsibility to make generative AI more trustworthy
    Common Crawl’s size and openness makes generative AI research and development more transparent and enables the development of LLMs by researchers and smaller companies compared to big tech. However, AI builders are not necessarily as transparent about their usage of Common Crawl as they could be. For example, they sometimes do not reveal details about the filtering process. Moreover, the methods used by AI builders to filter Common Crawl for problematic content are often too simplistic to adequately address concerns about bias and toxicity.
    The report suggests that Common Crawl could better highlight the limitations and biases of its data, and be more transparent and inclusive about its governance. Moreover, it could help enforce better transparency among AI builders by requiring the use of Common Crawl to be attributed. Additionally, Common Crawl could rely less on automation for deciding what to crawl, and instead adopt a more community-oriented approach that might help to make the data more diverse in terms of whose perspectives are included.
    For AI builders, the report recommends putting more effort into detecting and removing problematic content from Common Crawl. A key problem is that the most popular filtered Common Crawl versions are not updated after their publication to take criticism into account, even though they continue to be used for years. Therefore, AI builders should endeavor to continuously improve the filtering of harmful content without reducing the representation of digitally marginalized communities. Long term, it would be best if AI builders rely more on datasets created by humans in equitable ways.

Press contacts:

Tracy Kariuki: [email protected]

Kevin Zawacki: [email protected]