It was inevitable. OpenAI is now getting hit with copyright lawsuits, including one from a famous comedian. See what she did there? đ
According to Law & Crime (ChatGPT, Meta used illegal âshadow libraryâ websites to train AI using Sarah Silvermanâs âBedwetterâ book: Lawsuit, written by Marisa Sarnoff and available here), in a class action complaint filed in federal court Friday, Sarah Silverman accused tech company OpenAI of using her book âThe Bedwetterâ to train its ChatGPT software â and, in doing so, violating her copyright. Author Christopher Golden and writer Richard Kadrey joined Silverman in the lawsuit.
According to the complaint, ChatGPT accessed databases of thousands of books in order to âtrainâ its programs â called âlarge language models, or LLMs â âby copying massive amounts of text and extracting expressive information from it.â This training, the lawsuit explains, is the key to allowing ChatGPT to âemit convincingly naturalistic text outputs in response to user prompts.â
The problem, however, is that the âtrainingâ material â including, allegedly, Silvermanâs book â is under copyright, and may have been pulled from databases of copyrighted works without permission.
âPlaintiffs and Class members did not consent to the use of their copyrighted books as training material for ChatGPT,â the lawsuit says. âNonetheless, their copyrighted materials were ingested and used to train ChatGPT.â
According to the complaint, it is believed that âthe reason ChatGPT can accurately summarize a certain copyrighted book is because that book was copied by OpenAI and ingested by the underlyingâ LLM. Silvermanâs lawyers concluded that âThe Bedwetterâ must have been part of the dataset because when ChatGPT was asked to summarize Silvermansâ book, the program did exactly that.
âIndeed, when ChatGPT is prompted, ChatGPT generates summaries of Plaintiffsâ copyrighted worksâsomething only possible if ChatGPT was trained on Plaintiffsâ copyrighted works,â the complaint says.
The lawsuit notes that â[t]he summaries get some details wrong,â which is expected due to the nature of LLMs. âStill, the rest of the summaries are accurate, which means that ChatGPT retains knowledge of particular works in the training dataset and is able to output similar textual content.â
The lawsuit (the filing of which is contained within the article) implies that the plaintiffsâ books were included in online book databases without permission and that ChatGPT drew its LLM training from those databases. One database called Project Gutenberg, described as an âonline archive of e-books whose copyright has expired,â allegedly boasted about having âover 60,000 titlesâ as of September 2020, the complaint says, noting that ChatGPT had previously acknowledged that one of the datasets was based on a collection of around 63,000 titles. The lawsuit suggests that a second dataset used by ChatGPT is based on so-called âshadow libraryâ websites that are âflagrantly illegalâ for their unauthorized sharing of copyrighted material, and comprises almost 300,000 titles.
By the way, Silverman, Golden, and Kadrey made similar allegations in a lawsuit filed against Meta last Friday over what they said were similar actions taken with its LLaMA AI writing software.
However, thatâs not only NOT the first class action filed against OpenAI (this one might be), itâs not even the first copyright class action filed against them. As reported by The Guardian, Mona Awad, whose books include Bunny and 13 Ways of Looking at a Fat Girl, and Paul Tremblay, author of The Cabin at the End of the World, filed their class action complaint to a San Francisco federal court the last week of June.
OpenAI is now getting hit with copyright lawsuits â and there will probably be lots more to come. No joke!
So, what do you think? Are you surprised that OpenAI is now getting hit with copyright lawsuits? Or surprised that it took this long? Please share any comments you might have or if youâd like to know more about a particular topic.
Disclaimer: The views represented herein are exclusively the views of the authors and speakers themselves, and do not necessarily represent the views held by my employer, my partners or my clients. eDiscovery Today is made available solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Today should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.
As noted in the Guardian article and elsewhere, in all of these cases, it may be difficult to prove that authors have suffered financial losses specifically because of ChatGPT being trained on copyrighted material, even if the latter turned out to be true.
And thatâs because ChatGPT may work âexactly the sameâ if it had not ingested the books because it is trained on a wealth of internet information that includes, for example, internet users discussing the books.
These cases are not slam dunks. We are going to see some *novel* briefs and responses. And it will interesting to see what/if OpenAI must disclose about its databases and its gears and levers because then you are brushing up against trade secret laws. I am working my way through all the briefs in the current crop of lawsuits and some surprising possibilities. But no time for a post đ¤ˇđťââď¸
But I can see where AI could ⌠could ⌠eventually resemble what happened with digital music and TV and movies and comply with copyright law. They will be based on licensed data, with the sources disclosed.
But this will be a long game đż
[…] In a federal class action complaint, Sarah Silverman accused OpenAI of using her book to train its ChatGPT software â and, in doing so, violating her copyright. Read more at ediscoverytoday.com […]
[…] OpenAI is Now Getting Hit With Copyright Lawsuits. No Joke!: Synthetic Intelligence Traits […]