Towards Best Practices for Open Datasets for LLM Training

Building on community insights from 30 AI dataset experts, this research paper distills best practices for creating open datasets for LLM training. The paper is a collaboration between Mozilla and EleutherAI.
Oersjoch
Many AI companies train large language models (LLMs) using data without copyright owners' permission, a practice permitted under specific restrictions in regions like the EU, UK, and Japan, but legally ambiguous in the U.S. Concerns from creators have spurred lawsuits and prompted organizations to limit transparency about training datasets, undermining accountability, innovation, and research. While using open access or public domain data could address these issues, no large-scale models trained on such data exist yet due to challenges like unreliable metadata, digitization costs, and the need for legal and technical expertise. Building towards a future where AI systems can be trained on openly licensed data that is responsibly curated and governed requires collaboration across legal, technical, and policy domains, along with investments in metadata standards, digitization, and fostering a culture of openness.
On June 11, 2024, Mozilla and EleutherAI convened 30 scholars and practitioners to create normative principles and technical best practices for creating openly licensed LLM training datasets. Based on that convening, this paper outlines the challenges of navigating the production of open datasets and provides practical recommendations for sourcing, processing, governing, and releasing of these datasets. While the paper references the OSI definition, it goes further in outlining possible tiers of openness, as well as it provides avenues for more ethical data governance in AI datasets.
In addition to best practices, the paper identifies a series of opportunities for policy and tech investments that would help the emerging community to overcome the outlined challenges. The paper seeks to be the foundation for shared practices and long-term goals within the emerging community around LLM data and to bring the community closer to making this technology truly open and trustworthy.