Generative AI models are trained on terabytes of web crawl data from across the internet. One of the most popular sources of training data is Common Crawl, a massive archive of web crawl data created by a small nonprofit. Mozilla’s latest investigation shows that Common Crawl has helped to make generative AI development more transparent and competitive, but AI builders need to be transparent about how they use such data because it reflects the internet’s biases and contains content that is toxic and harmful. Yet we don’t even know if big AI companies like Microsoft, Google or Meta are using Common Crawl to train their AI products, let alone how they filtered out harmful content.
When it comes to building trustworthy AI products, better is possible. We need to know the totality of how AI is trained so we understand its risks and limitations – and, most importantly, what needs to be improved to make it trustworthy and helpful for everyone on the internet.
Sign Mozilla’s petition and tell OpenAI, Google, Microsoft, and Meta to provide transparency about the data used to train their AI tools!