Mozilla speaks with the team behind the Data Provenance Initiative, a Data Futures Lab Awardee
In the struggle over who can train AI models and how, there’s a casualty many people don’t realize: The open web.
The web’s seemingly infinite content — a product of decades of thought and millions of minds — also provides a crucial training corpus for chatbots and other AI models. But in an effort to protect their work from unauthorized use, many content creators, from big news publications to smaller blogs, are closing down access by disallowing web crawling from their sites. This limits access for big AI companies, but also for researchers working in the public interest, archival projects that trace the history of the internet, and independent developers who deserve equitable access.
This is the thorny challenge at the core of “Consent in Crisis,” a new research paper authored by the Data Provenance Initiative (DPI), a volunteer research collective spanning more than 50 contributors and 15 countries. In their research, they have reviewed the changes to the Terms of Service and robots.txt files from the websites that feed popular training datasets and found rising and blunt restrictions.
(Read recent coverage of DPI's research in the New York Times.)
“We’re a growing collective who share a passion for transparency, audits, and understanding of the data that feeds AI systems,” explains Shayne Longpre, a PhD Candidate at the MIT Media Lab who leads DPI. “We developed organically, bringing together researchers interested in documenting the sources, licenses and creator composition of popular datasets. Surprisingly, this information had not been aggregated in any structured way.”
We’re a growing collective who share a passion for understanding the data that feeds AI systems.
Shayne Longpre
Data Provenance Initiative
The DPI is a 2024 Data Futures Lab awardee, one of five ambitious projects building tools that address issues of transparency, privacy, bias, and agency in the data lifecycle. The Data Futures Lab is an experimental space for builders working toward a more fair data economy, and helps fuel Mozilla’s broader trustworthy AI work.
Mozilla spoke with the study leads Shayne Longpre, Robert Mahari, Ariel Lee, and Campbell Lund about the new research, what this means for the open web, and what’s next for DPI.
____
Mozilla: How might the experience of browsing the open web change based on these mounting restrictions?
DPI team: Rising restrictions are most likely to affect general purpose AI systems as well as web archives and hybrid tools for information search and synthesis. The impact of these restrictions will have broad effects: erosion of web archives will hamper academic research or even journalism into the web (even unrelated to AI), as well as the many uses of AI that are neither commercial or generative.
The bluntness of existing consent signals are ultimately what makes it difficult for data creators to distinguish what uses they want to protect against versus those they would be okay with.
Mozilla: As more crawling restrictions are introduced, what are the implications for competition in the AI space? Does it further centralize power for the incumbent players?
DPI team: Centralization and concentration of AI resources is absolutely a growing concern, and data plays a key role. However, there isn't a straightforward remedy, as there often exists a tension between open data access (for transparency, web archives, academic research, journalism, and AI systems) and data use consent (crediting and compensating data creators, protecting affected industries).
If crawling restrictions continue to proliferate, and are respected, then only the incumbents, will have the ability to train the best models. Because only they can afford to license and curate the data. Budding organizations that hope to utilize web data are forced to choose between data creator consent and a level playing field in order to compete with the incumbent players. Meanwhile, we observe a rise in restrictive data licenses, which can make transparency efforts more challenging.
It is important that different rules apply to different types of use. Commercial AI development is distinct from non-commercial use, or from AI uses that attribute and redirect to their sources. Similarly, web archival and academic research into the web warrant different considerations due to their lack of economic harms and clear benefits. While regulations like the EU AI Act emphasize the need for data transparency, lawmakers have limited awareness of the complexity and nuance inherent to the AI data supply chain, a gap we hope to address through our Initiative.
Mozilla: As training data becomes less openly available, what are the concrete implications for generative AI systems? e.g. Limited capability? More bias?
DPI team: Our research shows that the most high-quality, actively maintained, and "fresh" sources (e.g. news, articles, review sites, social media) are becoming restricted at the highest rates. For those models that were trained on data that respected these consent signals, it will limit their capabilities especially with respect to new information, like news and current affairs. Instead, a larger portion of the data will come from sources that are less actively restricted, which make up the "rest" of the web: organization or personal websites, lower quality blogs, and e-commerce websites. From a legal perspective, websites that have paywalls in place may be in a better position to argue that using their content for training AI does not constitute “fair use,” which may create additional incentives to restrict access.
It is not entirely clear the ramifications of this, but the data skew could certainly impact bias, ideology, knowledge coverage, and information freshness and quality. And loss of open-source data means models are being trained at a layer beyond public scrutiny.
Mozilla: Different types of content (text, video, audio) have different levels of restriction. What are the possible downstream impacts of this? e.g. Generative AI that excels at video, but not text?
DPI team: Whereas text content is pretty widely available across the web, a huge quantity of public video content is concentrated on just a few websites: YouTube and Vimeo. Those platforms essentially host user data, but even when the user permissively licenses their content, the platform's terms of service can restrict third party crawling. This concentrates power for user/social platforms that can gatekeep data, and benefit from an incumbent advantage.
Mozilla: What research projects are on the horizon for the DPI?
DPI team: We plan to expand our trace of the sources underneath the data: the web, the humans, and increasingly, other machines. We are also expanding our breadth to look at more multimodal data sources. Lastly, we hope our analysis can and will shed light on future AI policy that governs data consent and access more broadly. A collaborative effort involving policymakers and AI developers is needed to develop new protocols and infrastructure that make data sourcing work effectively and ethically.
Learn more about Mozilla’s Data Futures Lab.