Mozilla Supports Five Projects Creating Building Blocks for a Better Data Ecosystem

The Data Futures Lab welcomes its 2024 Infrastructure Fund awardees

(WEDNESDAY, MARCH 13, 2024) — Today, Mozilla is announcing its 2024 Data Futures Lab Infrastructure Fund awardees: five ambitious projects building tools that address issues of transparency, privacy, bias, and agency in the data lifecycle.

These projects will each receive up to $50,000 along with support and training from Mozilla staff and fellows. Mozilla released an open call for awardees in July 2023, and received more than 250 applications from 54 countries.

As an experimental space for builders working towards a more fair data economy,, the Data Futures Lab is the perfect place for these projects to build and release tools and methods that can be leveraged by developers. All projects will make their code available under a public repository.

Says Lisa Gutermuth, Program Officer, Data Futures Lab: “This year’s Infrastructure Fund cohort features an eclectic mix of expertise — which is exactly what we need to shift the data ecosystem in a new and better direction. Mozilla is funding researchers and entrepreneurs, programmers and activists, and communities working on voice, text, and synthetic data as it relates to trustworthy AI.”

These projects will join Mozilla’s existing network of awardees and fellows pursuing a more equitable data ecosystem — like Mozilla Technology Fund awardee Evaluation Harness, an open-source tool for evaluating large language models, and Senior Fellow alumni in Trustworthy AI Bogdana Rakova, who is exploring the use of computational contracts to enable new modes of interactions between people and consumer tech companies.

This year’s Infrastructure Fund cohort features an eclectic mix of expertise — which is exactly what we need to shift the data ecosystem in a new and better direction.

Lisa Gutermuth, Program Officer, Data Futures Lab

Learn more about the projects:

Data Provenance Initiative: Mapping the provenance of popular datasets

USA

Recent breakthroughs in language modeling are powered by large collections of natural language datasets. This has triggered an arms race to train models on disparate collections of incorrectly, ambiguously, or under-documented data that has left practitioners unsure of the ethical and legal risks. To address this, the Data Provenance Initiative has created a mapping of 2,000+ popular, text-to-text fine-tuning datasets from origin to creation, cataloging their data sources, licenses, creators, and other metadata, for researchers and developers to explore using this tool. The purpose of this work is to improve transparency, documentation, and informed use of datasets in AI.

See their presentation recording as part of the DFL Speaker Series in January 2024.

Imperial College London: Identifying privacy risk in AI-generated synthetic data

The Computational Privacy Group at Imperial College London will build on their initial research around detecting privacy risk in AI generated synthetic datasets, and publish an open-source toolkit that enables builders to evaluate the privacy risk of AI generated synthetic data before releasing it. The initiative is titled “Leaving no one behind: a tool to flag privacy risk in AI generated synthetic data.”

Fundación Vía Libre: Detecting discriminatory behaviors in AI

Argentina

Fundación Vía Libre will build on their existing toolset, EDIA (Spanish abbreviation for “Stereotypes and Discrimination in Artificial Intelligence”), which inspects core components of automatic language processing technologies to detect and characterize discriminatory behaviors. Specifically, they will use community-centered methods to build a language dataset that represents stereotypes in Argentina; publish programming libraries to integrate the dataset in audit processes for public and private institutions who use language models; and publish structured content and teaching materials so that others can replicate their methods for other languages and contexts.

See their presentation recording at a DFL Community Call in July 2023

Data Science Law Lab: Designing a more responsible data license

South Africa

The Data Science Law Lab out of the University of Pretoria will conduct research that addresses the shortcomings of using creative commons licenses in certain contexts (such as reinforcing extractive practices and digital colonialism) and create a prototype for a new data license based on their findings.

Sign up for their talk as part of theDFL Speaker Series, which is running through the first half of 2024 and explores fair use and transparency in the generative AI data ecosystem.

FLAIR Initiative (First Languages AI Reality): Community-centered dataset creation

USA (indigenous communities)

FLAIR will work with an indigenous language community using their software and methodology to collect the necessary corpus data to develop Automatic Speech Recognition (ASR) for the community’s language while minimizing the burden on current speakers. Given the limitation of both indigenous language data availability and speakers, they will employ a method that uses minimal inputs (around 500 phrases) as stimuli. They will publish the source code and a methodology manual that aims to enable further indigenous language communities to revitalize their languages more rapidly and effectively, using their own data, and on their own terms.

Press contact: Kevin Zawacki | [email protected]