DFL

In 2024, Mozilla’s Data Futures Lab is hosting a speaker series exploring a more equitable data ecosystem in the era of generative AI. We’ll feature builders, legal experts, and researchers who identify issues and propose concrete solutions.

Read the full schedule — and register — below.

Right now, AI models trained on large swaths of data across the internet – such as ChatGPT, DALL-E, and Midjourney – are built on extractive methods. They rely on data from individuals, communities, and creators without their knowledge, without their consent, without attribution. And without the opportunity to benefit from profits made (and even paying for services created with their own data inputs).

There has been some progress making these methods more equitable: litigation, policy, and technical interventions. But we ultimately remain in a period of suspension, waiting for resolute solutions as the very landscape of the internet shifts.

Our speaker series will highlight the people, projects, and ideas building a more equitable data ecosystem in the era of generative AI.

~

The Data Futures Lab highlights, supports, and connects initiatives that take alternative approaches to AI at the data level. We focus on projects that shift power from the hands of a few corporate actors, to those from whom the data derives and whom it most impacts.

The Program:

Where is All This Data Coming From?

January 22 (11am EST / 5pm CET): Shayne Longpre, Naana Obeng-Marnu, and William Brannon, three core contributors to the Data Provenance Initiative, will present their work mapping of 2000+ popular, text-to-text finetuning datasets from origin to creation, cataloging their data sources, licenses, creators, and other metadata, for researchers and builders to explore. Register here.

Who is Using My Data?

February 20 (11am EST / 5pm CET): Cullen Miller, VP of Policy at Spawning, will present their work building some of the only tools that enable creatives to determine if their work is part of a training dataset (Have I been trained), opt-out (ai.txt), and identify active web scrapers and reject or misdirect all requests from the scrapers (Kudurru). Register here, or join the watch party of the livestream in the MozFest Discord.

Is Using that Data Even Legal?

March 18 (11am EST / 4pm CET): Chris Bavitz, of Berkman Klein Center for Internet & Society, Beatriz Busaniche, of Vía Libre Foundation and Creative Commons Argentina, and Dr. Melissa Omino of the Center for Intellectual Property and Information Technology Law (CIPIT) at Strathmore University, will discuss the challenges of existing laws and licenses in the context of data acquisition for training generative AI systems. Register here.

April: TBD

Is There a Better Way to Govern All This Data?

May 13 (11am EST / 5pm CET): Common Voice is the world’s largest crowd-sourced multilingual open speech corpus. To date, all the data has been released under a CC0 data license but they are going through a collaborative process with data creators to understand how they might offer alternative governance pathways in the future. Product Director EM Lewis-Jong and Community Coordinator Gina Moape will talk about their experience and findings thus far. Register here.

Can New Data Licenses Address Major Issues?

June 17 (11am EST / 5pm CET): Dr. Chijioke Okorie, Founder and Leader of the Data Science Law Lab at the University of Pretoria will present their work to create a new data license given the known drawbacks of using creative commons licenses in certain contexts (reinforcing extractive practices and digital colonialism, for example). Register here.