We assume that AI models will continue to get better, but we may be hitting a plateau. If AI doesn’t keep getting better forever, then what?
As discussed in Ars Technica (What if AI doesn’t just keep getting better forever?, written by Kyle Orland and available here), new reports highlight fears of diminishing returns for traditional LLM training.
A weekend report from The Information effectively summarized how these fears are manifesting amid a number of insiders at OpenAI. Unnamed OpenAI researchers told The Information that Orion, the company’s codename for its next full-fledged model release, is showing a smaller performance jump than the one seen between GPT-3 and GPT-4 in recent years. On certain tasks, in fact, the upcoming model “isn’t reliably better than its predecessor,” according to unnamed OpenAI researchers cited in the piece.
On Monday, OpenAI co-founder Ilya Sutskever, who left the company earlier this year, added to the concerns that LLMs were hitting a plateau in what can be gained from traditional pre-training. Sutskever told Reuters that “the 2010s were the age of scaling,” where throwing additional computing resources and training data at the same basic training methods could lead to impressive improvements in subsequent models.
“Now we’re back in the age of wonder and discovery once again,” Sutskever told Reuters. “Everyone is looking for the next thing. Scaling the right thing matters more now than ever.”
Guess what a big part of the problem is? Quality data to train on. According to experts and insiders, a large part of the training problem is a lack of new, quality textual data for new LLMs to train on. At this point, model makers may have already picked the lowest hanging fruit from the vast troves of text available on the public Internet and published books.
Research outfit Epoch AI tried to quantify this problem in a paper earlier this year, measuring the rate of increase in LLM training data sets against the “estimated stock of human-generated public text.” After analyzing those trends, the researchers estimate that “language models will fully utilize this stock [of human-generated public text] between 2026 and 2032,” leaving precious little runway for just throwing more training data at the problem.

OpenAI and other companies have already begun pivoting to training on synthetic data (created by other models) in an attempt to push past this quickly approaching training wall. But that has led to concerns about degenerative AI and results that degrade after several repetitions.
Data has exploded in the world over the past 20 years. But quality data to train AI models is still in short supply. If AI doesn’t keep getting better, it will be due to the lack of quality data. It’s always about the data.
So, what do you think? Do you think we’re heading for a plateau in the advancement of AI models? Please share any comments you might have or if you’d like to know more about a particular topic.
Image created using GPT-4o’s Image Creator Powered by DALL-E, using the term “robot in a lake treading water”.
Disclaimer: The views represented herein are exclusively the views of the authors and speakers themselves, and do not necessarily represent the views held by my employer, my partners or my clients. eDiscovery Today is made available solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Today should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.
Discover more from eDiscovery Today by Doug Austin
Subscribe to get the latest posts sent to your email.



