Evaluation Harness Is Setting the Benchmark for Auditing Large Language Models

This is a profile of Evaluation Harness, a 2023 Mozilla Technology Fund Awardee.

Over the last couple of years, we’ve seen more and more Large Language Models (LLMs) gaining popularity. LLMs are tools like ChatGPT, and the more recently launched HuggingChat, that are trained on large language datasets so they can generate human-like responses. These LLMs are susceptible to bias and misinformation. For example, if an LLM is built on datasets that associate the word chef with men and cook with women, this can be discriminatory.

As they become more ubiquitous, these tools require more scrutiny. With little to no peer reviewing on LLMs, what happens when these tools are used in real-world scenarios? How do they fare on things like accuracy and precision? Just how reliable are they?

The LM Evaluation Harness, developed by EleutherAI and later expanded by BigScience, is addressing this problem. They are one of this year’s Mozilla Technology Fund awardees focusing on auditing tools for AI systems. Auditing processes help build accountability for the AI systems that play an increasingly important role in our daily lives.

Having evolved since 2020, Evaluation Harness is an open-source tool that allows researchers and developers to put their LLMs through a robust evaluation process. The process uses a framework that is standardized and tests for the accuracy and reliability of the language model. A user chooses what benchmarks they would like to test their model against, run these in the system, and then receive results.

Some examples of the benchmarks used to test the LLMs include questions and answers, multiple choice questions, and even tasks that test against gender bias — similar to what humans would be able to do.

Prior to the Evaluation Harness, developers were creating models, evaluating them according to their own code, and not releasing the results — meaning testing the efficacy of their outcomes was a difficult and tedious task. Hailey Schoelkopf, a Senior Scientist at EleutherAI and lead developer on the Evaluation Harness project, says this is the problem they sought to solve.

“We believe that you shouldn't trust the results specified in a paper unless you're able to replicate them yourself and see the numbers yourself. And so the harness is a way for people to test [LLMs] on a wide range of benchmark tasks, and actually be able to compare it to other results in the literature and check how all these things hold up,” Schoelkopf shares.

We believe that you shouldn't trust the results specified in a paper unless you're able to replicate them yourself and see the numbers yourself. And so the harness is a way for people to test [LLMs] on a wide range of benchmark tasks, and actually be able to compare it to other results in the literature and check how all these things hold up

Hailey Schoelkopf, Senior Scientist at EleutherAI

While there are other resources out there including Google’s BIG-Bench and HELM by Stanford, one key consideration in the building of Evaluation Harness makes their product different.

“There’s a fragmenting where if you want to evaluate a bunch of different benchmarks that all of the different organizations have dealt with a number of times, you need to go through different code bases. The main priority for us was to give a single place, where, if you have a new model, you can implement it in our framework and boom, you can evaluate on hundreds of tasks,” says Stella Biderman, an AI researcher at Booz Allen Hamilton and Executive Director of EleutherAI, who is one of the creators of the Evaluation Harness.

The idea for Evaluation Harness was originally inspired by the paper “Language Models are Few-Shot Learners” by OpenAI presenting the GPT-3 model. There was a lot of evaluation data presented in the paper — however, the code on which GPT-3 was tested was never released to the public. As a result, the team at Evaluation Harness began to evaluate those datasets on their own and test their benchmarks against it.

“It started off as building a way to test our own models on the same sorts of tasks in the GPT-3 paper. Our scope and interest has increased as more benchmarks have come out,” says Biderman.

Before the Evaluation Harness, trying to evaluate benchmark tasks was a harrowing ordeal.

“There was a huge amount of manual work that was required to convert between frameworks and run evaluations across different code bases. When we started doing it we were like ‘Well this is insane that people put up with this’ and so we wanted to make that process a lot smoother,” says Biderman.

There was a huge amount of manual work that was required to convert between frameworks and run evaluations across different code bases. When we started doing it we were like ‘Well this is insane that people put up with this’ and so we wanted to make that process a lot smoother.

Stella Biderman, AI researcher at Booz Allen Hamilton

The tool, already used by EleutherAI, the BigScience research project, and in papers by Google and Microsoft, is seeing success. With the impact of their work, Evaluation Harness intends to do even more.

One of the areas where Biderman, Forde, and Schoelkopf see a gap and would like to have an impact is the equity of multilingual LLMs. There is currently a skew toward English and Chinese language models.

“We're very interested in providing improved tools for the evaluation of multilingual large language models. This was a limitation we had found in the literature when we were using Evaluation Harness for BigScience. So the thing we are going to be working on is putting together a new dataset of machine-generated and human-generated summaries across 10 different languages,” says Jessica Forde, Ph.D. student at Brown University and Principal Investigator on the MTF-funded project to improve the evaluation of LLMs using Evaluation Harness.

Biderman states that the problem is there just hasn’t been enough investment in training or evaluating these models in a wide variety of languages. She describes the current way of doing things as lacking nuance.

“Typically the way that people evaluate the Swahili ability of a language model is they take English text, and they translate it into Swahili, and then they translate it back into English, and then they evaluate on an English benchmark, and they assume if the model were good at Swahili, then it would still be good at it after you've done this translation,” says Biderman.

This means the models are failing at picking up important nuances embedded in different language systems. They say, however, that this work is no easy feat.

“As a researcher, you have to be respectful of every single language on their own terms and the speakers of those languages. This work is essential to do this and it has been taking a lot of time,” says Forde.

The Evaluation Harness team and the multilingual evaluation project are comprised of researchers based at Brown University and EleutherAI.

The Mozilla Technology Fund (MTF) supports open-source technologists whose work furthers promising approaches to solving pressing internet health issues. The 2023 MTF cohort will focus on an emerging, under-resourced area of tech with a real opportunity for impact: auditing tools for AI systems.