Hero Image

Bringing AI Down to Earth: Evaluation as a Main AI Theme in 2025

Roya Pakzad, Mozilla 2025 Fellow

From groundbreaking innovations to bold visions, our 2025 Fellows share their predictions on where technology is headed—and the impact it could have on the world.

See the full list →

The AI landscape in 2025 will undergo a much-needed transformation, shifting from excitement around general capabilities to a focus on evaluation of real-world, domain-specific performance. While the past few years have been largely about celebrating the potential of generative AI and large language models, the next phase will demand practical answers: How effective are these systems at performing specific tasks in domains such as healthcare, government services, humanitarian crises, or social media content governance?

Currently, most benchmarks evaluate AI on static tasks (such as question-answering or image classification) with predefined datasets. However, these benchmarks fail to capture the complexity of real-world applications. For instance, a language model might perform well on a standard benchmark but falter when tested in non-English or diverse contexts, where knowledge of subtle linguistic nuances or knowledge of societal norms are most important for understanding. As AI systems become more “agentic,” grappling with memory, reasoning, actions, and third-party tools integrations, traditional evaluation methods fall short. Can an AI agent equitably manage requests in diverse languages and cultural contexts? How will it respond to unpredictable scenarios in crisis management or public services? These are the kinds of questions that demand more nuanced socio-technical evaluation approaches and fresh thinking about benchmarks.

My fellowship project, Equitable AI Benchmarking for Linguistic Diversity, addresses these gaps head-on. This open web-based platform realigns AI benchmarking practices to better serve non-English speaking communities, especially those most vulnerable to AI-related harm. By creating contextually and linguistically nuanced benchmarking data and practices, in collaboration with civil society organizations, the project enables evaluations that reflect the lived realities of marginalized communities. It recognizes that traditional benchmarks, often created by private companies or academic institutions, fail to incorporate sufficient input from the communities most affected by the technology.

Recent developments underscore the urgency of this work. Leading AI labs, government agencies, and philanthropic groups are actively exploring new methodologies to address gaps in current evaluation systems. Major AI conferences such as NeurIPS now host dedicated workshops scrutinizing the shortcomings of existing benchmarks and exploring ideas for more community-driven, participatory approaches to testing AI systems.

As AI systems grow in complexity, evaluation must keep pace. And 2025 will be the year to witness more nuanced evaluation frameworks, techniques, and benchmarks that will help cut through the hype surrounding GenAI capabilities and bring them down to earth.

Image of Roya Pakzad

Roya Pakzad is a 2025 Mozilla Fellow.

From groundbreaking innovations to bold visions, our 2025 Fellows share their predictions on where technology is headed—and the impact it could have on the world.

See the full list →