Towards Best Practices for Open Datasets for LLM Training

Jan. 13, 2025
Openness and AI / AI fairness, accountability, and transparency
Dataset Convening logo

Overview

Many AI companies train large language models (LLMs) using data without copyright owners' permission, a practice permitted under specific restrictions in regions like the EU, UK, and Japan, but legally ambiguous in the U.S. Concerns from creators have spurred lawsuits and prompted organizations to limit transparency about training datasets, undermining accountability, innovation, and research. While using open access or public domain data could address these issues, no large-scale models trained on such data exist yet due to challenges like unreliable metadata, digitization costs, and the need for legal and technical expertise. Building towards a future where AI systems can be trained on openly licensed data that is responsibly curated and governed requires collaboration across legal, technical, and policy domains, along with investments in metadata standards, digitization, and fostering a culture of openness.

On June 11, 2024, Mozilla and EleutherAI convened 30 scholars and practitioners to create normative principles and technical best practices for creating openly licensed LLM training datasets. Based on that convening, this paper outlines the challenges of navigating the production of open datasets and provides practical recommendations for sourcing, processing, governing, and releasing of these datasets. While the paper references the OSI definition, it goes further in outlining possible tiers of openness, as well as it provides avenues for more ethical data governance in AI datasets.

In addition to best practices, the paper identifies a series of opportunities for policy and tech investments that would help the emerging community to overcome the outlined challenges. The paper seeks to be the foundation for shared practices and long-term goals within the emerging community around LLM data and to bring the community closer to making this technology truly open and trustworthy.

Collaborators

Mitchell Baker, Ayah Bdeir, Julie Belião, Jillian Bommarito, Kasia Chmielinski, Jennifer Ding Marzieh Fadaee, Maximilian Gahntz, Lisa Gutermuth, Paul Keller, Hynek Kydlíček, Pierre-Carl Langlais, Solana Larsen, Greg Leppert, EM Lewis-Jong, Greg Lindahl, Shayne Longpre, Angela Oduor Lungati, Nik Marda, Cullen Miller, Victor Miller, Max Ryabinin, Maarten Van Segbroeck, Kathleen Siminyu, Mark Surman, Anna Tumadóttir, Jennifer Wang, Maurice Weber, Rebecca Weiss, Leandro von Werra, Lee White, Thomas Wolf