A new report from a plagiarism detector found that nearly 60 percent of GPT 3.5 outputs contained some form of plagiarized content.
In the wake of several lawsuits regarding AI infringing on copyright and potentially plagiarizing, educational institutions and enterprises across the globe are questioning the authenticity of AI text: Where did it originate from? Is it safe to use as original content? Does AI plagiarize?
To find out, plagiarism detector Copyleaks conducted an analysis to determine the degree to which AI-generated content is original and free of potential plagiarism. To conduct the analysis, Copyleaks asked GPT 3.5 to write 1,045 outputs, averaging 412 words across all outputs, in 26 subjects. The resulting report is available here.
The findings? Nearly 60 percent of GPT 3.5 outputs (59.7 percent to be exact) contained plagiarized content. 45.7% of all outputs contained identical text, 27.4% contained minor changes, and 46.5% had paraphrased text.
Copyleaks uses a specific scoring method (called the Similarity Score) that aggregates the rate of identical text, minor changes, paraphrased text, and more. A score of 0% signifies that all the content is original, whereas a score of 100% means that none of the content is original.
Among the 26 subjects, the subject with the highest average Similarity Score was Physics at 31.3%, followed closely by Psychology at 27.7% and Science at 26.7%. The subjects with the lowest average Similarity Score were Theater at 0.9%, Humanities at 2.8%, and English Language at 5.4%.
Of course, this was a study of GPT 3.5 outputs (which is what most people have used as part of the free version of ChatGPT), not GPT 4.0. Will they do a study of 4.0 as well? We’ll see.
So, what do you think? Are you concerned that the report found nearly 60 percent of GPT 3.5 outputs contained some form of plagiarized content? Please share any comments you might have or if you’d like to know more about a particular topic.
Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by my employer, my partners or my clients. eDiscovery Today is made available solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Today should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.
Discover more from eDiscovery Today by Doug Austin
Subscribe to get the latest posts sent to your email.




Well, that’s the issue, isn’t it. Given how how ChatGPT works it’s conceivable that it might unintentionally generate text that mirrors the exact style or content of its training data, unbeknownst to users. This is one of the key issues in the NYTimes case. This complicated issue lacks a simple solution and raises critical questions about how society perceives and defines plagiarism in the context of AI-generated text.
Such a gray area exists because AI models like ChatGPT don’t consciously “know” or “intend” anything – they merely produce output based on their training. This contrasts with human plagiarism, where there is a deliberate act. With AI, the concept of “intent” is irrelevant, prompting reconsideration of what constitutes plagiarism in this emerging context.
Moving forward, the implications of licensing and copyright in AI’s context are undoubtedly complicated. AI, being in its infancy, operates in a murky realm regarding copyright law. Modern laws were not formulated with AI in mind, posing a challenge to their application to this new technology.
While copyright laws may differ across jurisdictions, they typically don’t apply to data used in machine learning processes. However, the final output could potentially violate copyright if it strongly resembles copyrighted material. This is the hill the NYTimes plaintiffs are trying to climb.
[…] Nearly 60 Percent of GPT 3.5 Outputs Contained Plagiarized Content: Artificial Intelligence Trends (eDiscovery Today) […]
[…] applying plagiarism prevention methods safeguards your integrity. Studies show that nearly 60% of AI outputs contain some form of plagiarized content, highlighting the importance of vigilance. […]