Words That Could Identify Generative

Words That Could Identify Generative AI Text: Artificial Intelligence Trends

A recent paper delves into excess words that could identify generative AI text. See what I did there? You will, if you keep reading!

According to Ars Technica (The telltale words that could identify generative AI text, written by Kyle Orland and available here), a group of researchers has established a novel method for estimating large language model (LLM) usage across a large set of scientific writing by measuring which “excess words” started showing up much more frequently during the LLM era (i.e., 2023 and 2024). The results “suggest that at least 10% of 2024 abstracts were processed with LLMs,” according to the researchers.

In the paper titled Delving into ChatGPT usage in academic writing through excess vocabulary, four researchers from Germany’s University of Tubingen and Northwestern University took a look at “excess word usage” after LLM writing tools became widely available in late 2022, the researchers found that “the appearance of LLMs led to an abrupt increase in the frequency of certain style words” that was “unprecedented in both quality and quantity.”

Advertisement
Cimplifi

To measure these vocabulary changes, the researchers analyzed 14 million paper abstracts published on PubMed between 2010 and 2024, tracking the relative frequency of each word as it appeared across each year. They then compared the expected frequency of those words (based on the pre-2023 trendline) to the actual frequency of those words in abstracts from 2023 and 2024, when LLMs were in widespread use. Sounds exciting! 😉

The results found several words that were extremely uncommon in these scientific abstracts before 2023 that suddenly surged in popularity after LLMs were introduced. The word “delves,” for instance, shows up in 25 times as many 2024 papers as the pre-LLM trend would expect; words like “showcasing” and “underscores” increased in usage by nine times as well. Other previously common words became notably more common in post-LLM abstracts: the frequency of “potential” increased 4.1 percentage points; “findings” by 2.7 percentage points; and “crucial” by 2.6 percentage points, for instance. Here are a couple of graphs from the paper that illustrate some of the excess words since 2022:

Copyright (C) Delving into ChatGPT usage in academic writing through excess vocabulary

While some of the upticks in certain words being used could be at least partially attributed to the natural evolution of language, some of them appear to be related to the use of LLMs. Frankly, I’m surprised that “crucial” is only up 2.6 percent as it’s one of the words that I’ve noticed that ChatGPT likes to use – a lot.

I’ll have to admit I haven’t read the entire 13-page paper as I have a lot on my plate at the moment (but I’ll save it in case I’m having trouble sleeping tonight!). 😉 With most AI writing detectors having difficulty accurately identifying LLM-generated text, the use of words that could identify generative AI text may be your best bet for sniffing it out!

Advertisement
Cloudficient

So, what do you think? Have you run across the use of text that you felt was LLM-generated and not disclosed? Please share any comments you might have or if you’d like to know more about a particular topic.

Image created using GPT-4o’s Image Creator Powered by DALL-E, using the term “robot looking at a computer with the word “delves” on it”. I tried six times to get an image that spelled “delves” and not “deves”, without success. Apparently, DALL-E and ChatGPT don’t share the same fondness for the word “delves”! 😀

Disclaimer: The views represented herein are exclusively the views of the authors and speakers themselves, and do not necessarily represent the views held by my employer, my partners or my clients. eDiscovery Today is made available solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Today should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.


Discover more from eDiscovery Today by Doug Austin

Subscribe to get the latest posts sent to your email.

Leave a Reply