Zeno: An Interactive Tool For AI Model Evaluation

This is a profile of Zeno, a Mozilla Technology Fund awardee

Siri, Alexa, and Google Assistant voice assistants have revolutionized human-computer interaction. From helping you search for facts to switching off your lights, these assistants can help make your life easier. But they become significantly less helpful if they don’t understand your English accent.

Just like these virtual assistants, Large Language Models (LLMs) like GPT-3 and BERT are also changing how humans interact with AI. From helping a user understand concepts to writing text for them, these tools are beneficial for day-to-day use but they are often not very reliable either, giving arbitrary answers or even giving non-factual or fake information, a phenomenon known as “hallucinating.”

These tools often fall short in meeting the needs of their users, exhibiting biases, lacking accessibility, inclusivity, and trustworthiness. Often, the underlying AI models that power these tools need to be tested beyond simple aggregate metrics such as accuracy. Alex Cabrera, project lead at Zeno, says as these models become more complex there is a need to find ways to evaluate them more effectively.

“It's becoming much harder to understand how well these models perform. For example, how accurate they are, whether they work for certain edge cases or whether they have certain biases or safety concerns,” Cabrera expresses.

Zeno, a part of the 2023 Mozilla Technology Fund cohort, is an interactive platform for exploring and managing data, debugging models, and tracking and comparing model performance. This tool allows practitioners, including those without coding knowledge, to test models including image classification or generation, audio transcription, Q&A chatbots, and more.

“The idea behind Zeno was whether we can create a platform that lets people encode all these complex properties that they want their model to have and start exploring those properties and comparing models against each other,” says Cabrera.

Zeno was created to help evaluate models on a number of different metrics. These would depend on what the designer deems important for the model to do.

The idea behind Zeno was whether we can create a platform that lets people encode all these complex properties that they want their model to have and start exploring those properties and comparing models against each other.

Alex Cabrera, Project Lead at Zeno

“The traditional paradigm of just calculating one number for accuracy and calling it a day doesn't really work. For instance, there are a number of different dimensions for which a generated image can be good or bad,” Cabrera adds.

Using image generation and LLMs as examples, Cabrera says a designer might care about whether a generated image is blurry or whether human fingers appear realistic. Or for a text summarization model using LLMs, a designer could ask the tool to test for either the accuracy or the fluency of the summary. They would then improve their model based on these specific metrics they are concerned with.

Zeno currently allows a user to see a demo of how the platform works in real-time.

One could compare two different audio transcription models on the same phrase and see how these two systems each generate different responses. To illustrate this demo, Zeno used the Speech Accent Archive – a dataset of 2140 speech samples from 177 countries.

Comparing two different models, Zeno can test for the word error rate – how many words it mistranscribes – across both models. The demo results show what the error rate is based on the continent someone is from or at what age they learned to speak English. The developer can then create interactive visualizations to see how their model compares to others, which helps developers make decisions about what to improve and influences what the model may look and feel like as it goes out into the world.

“If you’re deploying your app in different places around the world, this data is something you might want to consider. Maybe you want to collect more training data from people with different accents or different parts of the world, or have better ways of dealing with potential errors and transcription errors,” Cabrera says.

If you’re deploying your app in different places around the world, this data is something you might want to consider.

Alex Cabrera, Project Lead at Zeno

An idea that came to fruition last year as he started the fourth year of his Ph.D., Cabrera says the ambitious goal for Zeno is that it works across all types of data types and models. He says while there are existing tools that allow for model evaluation, they usually focus on specific model errors or data types. He wanted to create a tool that is useful across diverse AI systems, given that the models they deploy increasingly have serious implications.

“I really want to build a tool that can influence the types of models people are putting out into the world”, Cabrera proclaims.

Over the next year with Mozilla, Cabrera hopes to improve the Zeno tool, especially by adding intelligent features such as automated error discovery.

“So far we've implemented a [tool] that can automatically find groups of data that have very high errors without you having to manually click through everything. We're also planning to automatically create some charts or visualizations that can show interesting insights or differences, and disparities between the models,” he shares.