Hero Image

Datasets will Become an Object of Investigative Inquiry for Journalists

Christo Buschek, Mozilla 2025 Fellow

From groundbreaking innovations to bold visions, our 2025 Fellows share their predictions on where technology is headed—and the impact it could have on the world.

See the full list →

For a long time, the discourse around AI has been either “Oh no, we are all doomed” or “Only with AI can we cure cancer.” But behind both narratives are the same people, the same companies, and the same interests. In 2025, we journalists will use dataset reporting to make our AI coverage better and hold the interests behind AI accountable.

Let me be clear: We journalists did a lousy job reporting on AI and rather than face it with skepticism, rigor, and investigative curiosity, many of us were dazzled by it and opted to repeat marketing material, treating it like a cool new gadget. Falling for the hype didn’t do us or our audiences any favors.

Another reason is that we have internalized the phrase “AI is a black box.” We say this, throw our hands in the air, and walk away. But we can do better. We need to do better if we want to hold AI systems and the companies building them accountable for their impact on the world.

While it is true that models are non-deterministic and unexplainable, the datasets used to train these systems are not. Datasets show that AI is not a black box but an assemblage of different technical artifacts and processes. It is the result of choices and values and a product of the culture where it originates. And when we look closer at the datasets used to train these incredibly complex machines, we actually recognize the models that they power and learn about the emerging effects of algorithmic systems.

Great reporters are already starting to investigate the building blocks of AI products to understand how they are made. Journalists showed that YouTube videos – swiped without the consent of their creators – are found in AI training datasets. They exposed communities and businesses behind deepfake non-consensual intimate imagery, and they identified Nazi propaganda in training datasets.For many decades, the investigative method has proven to be a powerful tool for transparency and accountability, not just for journalists. It is time to apply this method to the most complex technological systems ever conceived by humans.

Dataset investigation is one method for holistically interrogating AI systems — the data, the models, and the effects. It is only by looking at datasets that we get a better sense of how AI models work and the gaps, errors, and biases that can emerge.

We journalists need to exit the current hype train. Only then can we balance this technology’s features and tradeoffs, make informed decisions if, how, and why this technology would make sense for ourselves or our audiences, and report critically on AI and its impacts.

Image of Christo Buschek

Christo Buschek is a 2025 Mozilla Fellow.

From groundbreaking innovations to bold visions, our 2025 Fellows share their predictions on where technology is headed—and the impact it could have on the world.

See the full list →