Evaluating LLMs Through a Federated, Scenario-Writing Approach

What do screenwriters, AI builders, researchers, and survivors of gender-based violence have in common? I’d argue they all imagine new, safe, compassionate, and empowering approaches to building understanding.

In partnership with Kwanele South Africa, I lead an interdisciplinary team, exploring this commonality in the context of evaluating large language models (LLMs) — more specifically, chatbots that provide legal and social assistance in a critical context. The outcomes of our engagement are a series of evaluation objectives and scenarios that contribute to an evaluation protocol with the core tenet that when we design for the most vulnerable, we create better futures for everyone. In what follows I describe our process. I hope this methodological approach and our early findings will inspire other evaluation efforts to meaningfully center the margins in building more positive futures that work for everyone.

The problem of staying in the general

There is a major lack of transparency in the design, development, training data, and evaluation methods used to build and deploy generative AI models. This often makes it challenging for their users to trust their outputs. For example, an Air Canada passenger was recently misled by the airline's chatbot incorrectly explaining the company’s bereavement travel policy. How can technologists, researchers, policy experts, and impacted communities collaborate to mitigate such risks? Evaluation is an active area of research at the forefront of generative AI adoption. It is connected and builds on related works on algorithmic auditing, risk assessment, red-teaming, and benchmarks, as well as the so-called participatory turn in AI design. An evaluation framework or protocol describes the evaluation objectives and the procedure through which an evaluation takes place. Research has shown that who gets to participate in an evaluation can affect the results - for example, domain experts, university students, and professional annotators will inevitably have a different approach to assessing the accuracy of a particular LLM outcome. Technology companies today pay third-party annotators or so-called “labelers” to contribute to the evaluation of new models. Fundamentally, the recent waves of generative AI innovation rely on the hidden work of labelers who are paid to provide their preferences about which outcomes of a model are better than others. Yet, in terms of accuracy and evaluations, there is no guarantee that labelers’ preferences are, in fact, accurate.

During the public announcement of her book Unmasking the Future of AI, Joy Buolamwini debated with Sam Altman about the difference between large general-purpose models vs. smaller domain-specific models (see a recording here). Building on her work and recent research, I argue that there’s a need to consider the contrast and significance between general-purpose vs. domain-specific evaluations. Even a general-purpose model needs to be evaluated within a specific intended context of use. For example, the question “Should AI discriminate on race and sexual preference?” showcased by an evaluation experiment by Anthropic, known as Collective Constitutional AI, means different things in the domains of medical diagnosis vs content moderation on a social media platform. A simple Yes/No answer to this question fails to provide insights into the intricate nuances associated with the model's outcomes.

The use case

Kwanele builds an LLM-based chatbot called Zuzi that provides legal assistance and social support to survivors of gender-based violence (GBV) in South Africa. The goal is to educate users about their legal rights and in case of emergency, to direct them to a relevant service and let a human take over care duties. Kwanele’s team wanted to engage in this partnership to make sure that they could mitigate any potential risks and harms of using an LLM in this already sensitive context. They see the chatbot embodying three roles: (1) a legal analyst, helping make the legalese within government regulations easier to understand; (2) a crisis response social worker, guiding people to report GBV and seek help; and (3) a mental health therapist, conversing with victims in a psychologically and potentially physically vulnerable state. In the early design of their LLM prototype, Kwanele's team leveraged a multi-stakeholder engagement framework - the Terms-we-Serve-with - in determining ways to incorporate AI in a manner that aligns with their mission, values, and the needs of their users.

A scenario-writing approach to identifying strengths and weaknesses

We conducted two online workshops, each session lasting two hours. Throughout the workshops, we implemented a federated design approach to the evaluation of Kwanele’s chatbot Zuzi, inspired by the federated learning paradigm in AI and the shared-by-design principle at the core of Mozilla’s MozFest community. Participants engaged in storytelling activities within small breakout groups facilitated by members of the Kwanele team. In the first session, we created five possible scenarios where a fictional persona K seeks to learn more about their legal rights and procedures and find social support. Then, participants imagined what they would ask, what would be the potential answers, and what answers they imagined would be most helpful. The goal was to understand potential strengths and weaknesses when a chatbot is providing answers to K’s questions. Then something unexpected happened in each scenario and participants discussed what could go wrong and what should have happened instead. In the second workshop, we invited community members to create their own personas and scenarios and share them in small groups. They each interacted with Zuzi using their scenarios and collectively discussed their observations and expectations. Finally, we engaged community members in leveraging their own expertise in co-designing different approaches to mitigation strategies based on the insights from the scenarios and their prior knowledge.

“These workshops were the first in an ongoing process of testing Zuzi. AI is hugely underutilized in certain population groups, where it has the potential to do huge good. However doing it right, recognizing potential harm, and ensuring transparency and ethical use of data is key to the success of such utilization. We are all feeling our way through this space, and these workshops, organized by Bogdana and with the help of other researchers, helped us ensure we hold ourselves to high standards of testing and deployment. Safe and trustworthy AI is going to take time, research, and many mistakes before it is right - but if we never try, it will never reach its full potential, and workshops like this, are the foundation in getting it started.”

- Leonora Tima, founder and executive director of Kwanele

The goal of the scenario writing approach was to narrow down a set of evaluation objectives that the technical team can monitor and consistently evaluate when the LLM is deployed in production. The scenarios participants designed and discussed showed that there’s a need to examine a chatbot’s ability to communicate its strengths and weaknesses and provide users with a sense of agency in their own experience of the interaction. Kwanele’s users are people who are potentially in a very sensitive situation. There might be challenges with how the chatbot is perceived, for example, the expectation of emotional support that a chatbot can’t provide. Privacy disclosures and consent are fundamental in this sensitive context. Consent mechanisms should meet people where they are without putting extra burdens on them. Participants were concerned about inaccuracies in Zuzi’s responses to more specific legal questions and Zuzi’s inconsistency when users rephrased a question in different ways. For example, the chatbot will need to pick up on words that are slang, or fragments, misspellings, and grammatical errors in non-English languages. The chatbot might not be up to date with the proper information and might not be capable of giving the right support. There’s also a need to address connectivity issues that might interrupt the flow of interactions with Zuzi. Having the user repeat their story multiple times could be retraumatizing for them. Through exploring different scenarios, workshop participants imagined different ways for Kwanele to detect and protect users from adversaries.

Implications for further research on evaluation protocol design

To leverage the benefits of generative AI, builders need to account for the real-world context of their users. As a user, I cannot be reduced to an entry in a database, but embody a story within a complex social context. Through this work and early findings, I argue that a scenario-writing approach could empower human-centered innovation that builds a more granular understanding of the challenges and strengths of using AI within specific contexts. Generative AI has transformative capabilities to empower low-resourced communities with access to justice when we center a relational, participatory, and socio-technical approach. Kwanele’s team is actively engaging with broad members of the communities they are serving to engage them in the evaluation and safe deployment of the technology.

Acknowledgments

Thank you to Leonora Tima, the executive director of Kwanele, and her team Ronel Koekemoer, Rachel Achieng, Shamryn Brittan, Chulumanco Nondabula, and Lebogang Sindani who facilitated the breakout group discussions during both workshops. Thank you also to Meg Young and Tamara Kneese from the Data & Society Institute and Hanlin Li, an assistant professor in the School of Information at UT Austin, for co-creating the research questions and workshop sessions.