Who Wrote That? Evaluating Tools to Detect AI-Generated Text

Mozilla research found that detection tools aren’t always as reliable as they say. Further, researchers found that large language models like ChatGPT can be successfully prompted to create more ‘human-sounding’ text

Introduction

As we wrote previously, generative AI presents new threats to the health of our information ecosystem. The major AI players recognize the risks that their services present: OpenAI published a paper on the threat of automated influence operations and their policy prohibits the use of ChatGPT for “political campaigning or lobbying, including generating campaign materials personalized to or targeted at specific demographics”, although our research has found that this policy is not sufficiently enforced.

Tools to help distinguish between human- and AI-written text would be helpful. Some such tools exist, but we must be careful to understand their strengths, biases, and limitations. When too much faith is placed in inaccurate tools, people can be harmed: Students have been falsely accused of submitting AI-written essays and The Markup reports that AI-detection tools can be biased against non-native English speakers.

Efforts at building detector tools so far have generally not been promising. OpenAI themselves released a tool “trained to distinguish between AI-written and human-written text” in January of 2023, but took it down in July of that year, citing “its low rate of accuracy”. One report says that “it was only successful at classifying 26% of AI-written text as "likely AI-written" and incorrectly labeled human-written text as AI 9% of the time.” They explain that they are “currently researching more effective provenance techniques for text, and have made a commitment to develop and deploy mechanisms that enable users to understand if audio or visual content is AI-generated.” But no new tools have been released by OpenAI so far.

Binoculars

There has been positive coverage of a recent method published by researchers from the University of Maryland called “Binoculars”, an approach which “look[s] at inputs through the lenses of two different language models.” They provide an open source implementation on GitHub, but caution that the “implementation is for academic purposes only and should not be considered as a consumer product. We also strongly caution against using Binoculars (or any detector) without human supervision.” Regardless, Business Insider writes: “A new AI-detection tool may have solved the problem of false positives for student writing, researchers say” while IEEE Spectrum discusses the method, writing that “Better and more effective AI detection techniques are on the horizon.” The authors write in their paper that “Binoculars detects over 90% of generated samples from ChatGPT (and other LLMs) at a false positive rate of 0.01%”. This means that the method should detect AI-written text 9 out of 10 times and only give a false positive (meaning an incorrect assessment that claims human-written text is AI-written) in 1 in 10,000 cases.

Our evaluation

In order to further evaluate the method, we use the AI Text Detection Pile dataset which includes 990,000 human-written texts and 340,000 AI-written examples. Its summary reads that “This is a large scale dataset intended for AI Text Detection tasks, geared toward long-form text and essays. It contains samples of both human text and AI-generated text from GPT2, GPT3, ChatGPT, GPTJ.”

The analysis notebook is available on GitHub here.

We evaluated the provided implementation on this dataset by asking the Binoculars tool to determine whether each example text was AI- or human-generated. By comparing these computed labels to the true labels provided in the dataset, we are able to determine for each text whether Binoculars correctly assessed the origin of the text.

Our evaluation shows a true positive rate of 43%, roughly half of what the authors found in their evaluation. More critically, the false positive rate is about 0.7%, 70 times higher than the authors’ finding – this means that the writer could be falsely accused of using AI in about 1 in 140 cases instead of 1 in 10,000.

The false positive rate is about 0.7%, 70 times higher than the authors’ finding – this means that the writer could be falsely accused of using AI in about 1 in 140 cases.

I reached out to the lead author of the Binoculars paper, Abhimanyu Hans, with these results. He suggested three possible explanations:

The data set that we used for evaluation was released about a year ago and much of the data set is generated by older models like GPT-2, for which the binoculars method might be less effective. However, this could only impact the true positive rate, not the false positive rate.

Text length varies. He explained that the binoculars method works best with texts about 256 tokens (about 1024 characters) long, with performance decreasing for shorter or longer texts.

Language. The model works best with English text and he suggested that the dataset might contain non-English text. I did not thoroughly validate this, but a casual examination confirms that the dataset is English only.

To test the impact of text length, we chose a target length of 1024 characters, which is approximately the 256 tokens the author specified. We then ran another evaluation in which we rejected all texts shorter than the threshold and truncated all other texts to that threshold. In this case the true positive rate remained approximately unchanged and the false positive rate decreased from 0.7% to 0.4% – a marked improvement, but still far from the author's findings.

I am certain that the performance that the authors report in their paper is true based on their evaluation data. But our findings raise a concerning lack of robustness, especially in the tendency to incorrectly claim that human-written text is generated with AI.

Examples

To understand the failures, here are some examples of false negatives (AI text rated as human) and false positives (human text rated as AI-generated). The texts have been shortened for this article:

As a parting experiment, let’s test it on an example political advertisement as in our previous article in which we used ChatGPT to generate political ads:

Retired Military Veteran, John: "Honor and support for our troops stand at the heart of the Biden administration. With a proven track record of enhancing veterans' healthcare and increasing defense budgets, we're committed to keeping America safe and treating our heroes with the dignity they deserve."

Young Urban Professional, Sofia: "Join Team Biden in building a sustainable future with cutting-edge clean energy initiatives and policies that foster innovation. Together, we can tackle climate change, reduce student debt, and create a dynamic job market that's ready for tomorrow's challenges."

Rural Small Business Owner, Emma: "President Biden understands the backbone of America's economy lies in its small businesses and farms. We're working to implement fair trade deals, improve rural infrastructure, and provide tax relief to help your business thrive in a global market."

Suburban Middle-Class Parent, Michael: "Every child deserves a first-rate education, and under the Biden administration, we're investing more in our public schools and educators than ever before. With policies aimed at making healthcare affordable and our neighborhoods safe, we're building stronger communities for all our families."

This was detected as AI-generated by the Binoculars. However, when I asked ChatGPT “Can you introduce some more natural language and small errors to make it seem more naturally written?” it produced the following:

Retired Military Veteran, John: "The heart of the Biden administration? It's all about honor and support for our troops. They've really stepped up, improving veterans' healthcare, bumping up defense budgets... It's about keeping us safe and treating our heroes right."

Young Urban Professional, Sofia: "Let's get on board with Team Biden for a greener future. They're rolling out some pretty cool clean energy plans and innovation policies. It’s about fighting climate change, cutting down student loans, and sparking a job market ready for what's coming."

Rural Small Business Owner, Emma: "Biden gets it – small businesses and farms are the real deal in our economy. Fair trade, better rural internet and roads, tax breaks... he’s pushing for changes that matter to us out here, making sure we can stand tall in the global market."

Suburban Middle-Class Parent, Michael: "Every kid should get top-notch schooling, right? Biden's team is pouring more into public schools and teachers than before. They're all about affordable healthcare and safe neighborhoods, really making a difference for families like ours."

The Binoculars system rated this as human-generated.

Conclusions

We have evaluated just one of many tools available to detect AI-generated text. In fact, we chose to evaluate this tool partly due to its high level of claimed performance, but also due to the fact that such an evaluation is possible due to the responsible open source release provided by the authors – many systems are closed, making third-party evaluation difficult or impossible. However, we feel that our findings are typical and limitations are inherent to the problem: AI-generated text is just not different enough from human-generated text to be able to consistently differentiate them. For a determined actor, if the text they generate is detected as AI, it’s fairly simple to just ask the model to make the text more natural-sounding, try a different model, or just work in languages or text-lengths in which the detectors don’t work. As well, the claims made by the authors of the Binoculars method are based on an evaluation on data generated by a small handful of models; our findings cast doubt on the degree to which they generalize to a broad spectrum of models, whether past or future.

AI-generated text is just not different enough from human-generated text to be able to consistently differentiate them.

Even flawed detector tools may have useful applications. For example, a platform might employ such tools to attempt to detect automated accounts and raise accounts flagged for further investigation. But it’s important to keep in mind that these tools may have biases that could disproportionately harm already marginalized communities online. And for certain applications, especially those where errors can have drastic consequences, such as with plagiarism detection, it’s unlikely that any tool will ever reach a high enough bar to allow confidence that students will not be falsely accused of using AI for an essay that, in reality, they worked hard to write themselves. A rollout of this method for plagiarism detection in a college department, for example, could result in widespread false accusations of plagiarism, possibly disproportionately targeting students for whom English is not their first language.

The challenge of detecting AI-generated content is receiving much attention these days, and rightly so. But policymakers and society more broadly should not rush to push for seemingly easy solutions to a complex problem. Instead, they should remain cautious of bold claims about supposed fixes, and they should invest in advancing this important field of research. Any policy solution around AI-generated text will need to respect the fact that AI-generated text is not necessarily distinguishable from what people write – and that tools developed to detect such differences can be gamed by ill-intentioned actors or prove ineffective.