OpenAI is Now Getting Hit

OpenAI is Now Getting Hit With Copyright Lawsuits. No Joke!: Artificial Intelligence Trends

It was inevitable. OpenAI is now getting hit with copyright lawsuits, including one from a famous comedian. See what she did there? 😉

According to Law & Crime (ChatGPT, Meta used illegal ‘shadow library’ websites to train AI using Sarah Silverman’s ‘Bedwetter’ book: Lawsuit, written by Marisa Sarnoff and available here), in a class action complaint filed in federal court Friday, Sarah Silverman accused tech company OpenAI of using her book “The Bedwetter” to train its ChatGPT software — and, in doing so, violating her copyright. Author Christopher Golden and writer Richard Kadrey joined Silverman in the lawsuit.

According to the complaint, ChatGPT accessed databases of thousands of books in order to “train” its programs — called “large language models, or LLMs — “by copying massive amounts of text and extracting expressive information from it.” This training, the lawsuit explains, is the key to allowing ChatGPT to “emit convincingly naturalistic text outputs in response to user prompts.”

Advertisement
Casepoint

The problem, however, is that the “training” material — including, allegedly, Silverman’s book — is under copyright, and may have been pulled from databases of copyrighted works without permission.

“Plaintiffs and Class members did not consent to the use of their copyrighted books as training material for ChatGPT,” the lawsuit says. “Nonetheless, their copyrighted materials were ingested and used to train ChatGPT.”

According to the complaint, it is believed that “the reason ChatGPT can accurately summarize a certain copyrighted book is because that book was copied by OpenAI and ingested by the underlying” LLM. Silverman’s lawyers concluded that “The Bedwetter” must have been part of the dataset because when ChatGPT was asked to summarize Silvermans’ book, the program did exactly that.

“Indeed, when ChatGPT is prompted, ChatGPT generates summaries of Plaintiffs’ copyrighted works—something only possible if ChatGPT was trained on Plaintiffs’ copyrighted works,” the complaint says.

Advertisement
eDiscovery Assistant

The lawsuit notes that “[t]he summaries get some details wrong,” which is expected due to the nature of LLMs. “Still, the rest of the summaries are accurate, which means that ChatGPT retains knowledge of particular works in the training dataset and is able to output similar textual content.”

The lawsuit (the filing of which is contained within the article) implies that the plaintiffs’ books were included in online book databases without permission and that ChatGPT drew its LLM training from those databases. One database called Project Gutenberg, described as an “online archive of e-books whose copyright has expired,” allegedly boasted about having “over 60,000 titles” as of September 2020, the complaint says, noting that ChatGPT had previously acknowledged that one of the datasets was based on a collection of around 63,000 titles. The lawsuit suggests that a second dataset used by ChatGPT is based on so-called “shadow library” websites that are “flagrantly illegal” for their unauthorized sharing of copyrighted material, and comprises almost 300,000 titles.

By the way, Silverman, Golden, and Kadrey made similar allegations in a lawsuit filed against Meta last Friday over what they said were similar actions taken with its LLaMA AI writing software.

However, that’s not only NOT the first class action filed against OpenAI (this one might be), it’s not even the first copyright class action filed against them. As reported by The Guardian, Mona Awad, whose books include Bunny and 13 Ways of Looking at a Fat Girl, and Paul Tremblay, author of The Cabin at the End of the World, filed their class action complaint to a San Francisco federal court the last week of June.

OpenAI is now getting hit with copyright lawsuits – and there will probably be lots more to come. No joke!

So, what do you think? Are you surprised that OpenAI is now getting hit with copyright lawsuits? Or surprised that it took this long? Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the authors and speakers themselves, and do not necessarily represent the views held by my employer, my partners or my clients. eDiscovery Today is made available solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Today should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

6 comments

  1. As noted in the Guardian article and elsewhere, in all of these cases, it may be difficult to prove that authors have suffered financial losses specifically because of ChatGPT being trained on copyrighted material, even if the latter turned out to be true.

    And that’s because ChatGPT may work “exactly the same” if it had not ingested the books because it is trained on a wealth of internet information that includes, for example, internet users discussing the books.

    These cases are not slam dunks. We are going to see some *novel* briefs and responses. And it will interesting to see what/if OpenAI must disclose about its databases and its gears and levers because then you are brushing up against trade secret laws. I am working my way through all the briefs in the current crop of lawsuits and some surprising possibilities. But no time for a post 🤷🏻‍♂️

    But I can see where AI could … could … eventually resemble what happened with digital music and TV and movies and comply with copyright law. They will be based on licensed data, with the sources disclosed.

    But this will be a long game 🍿

Leave a Reply