We Could Run Out of Data

We Could Run Out of Data by 2026. I’m Serious!: Artificial Intelligence Trends

Preposterous, you say, given the Big Data numbers we’ve all seen? But we could run out of data by 2026 – seriously. AI training data, that is.

That’s what this article from Real KM (Researchers warn we could run out of data to train AI by 2026. What then?, written by Rita Matulionyte and available here), says. As the author notes: “We need a lot of data to train powerful, accurate and high-quality AI algorithms. For instance, ChatGPT was trained on 570 gigabytes of text data, or about 300 billion words.”

That’s nothing – GPT-4 was trained using 1.76 trillion parameters!

Advertisement
Cimplifi

Similarly, the stable diffusion algorithm (which is behind many AI image-generating apps such as DALL-E, Lensa and Midjourney) was trained on the LIAON-5B dataset comprising of 5.8 billion image-text pairs. If an algorithm is trained on an insufficient amount of data, it will produce inaccurate or low-quality outputs.

The quality of the training data is even more important. Low-quality data such as social media posts or blurry photographs are easy to source, but aren’t sufficient to train high-performing AI models.

The author references the infamous Microsoft Tay incident as to why social media content isn’t good data for training AI models (no argument there).

According to the author, this is why AI developers seek out high-quality content such as text from books, online articles, scientific papers, Wikipedia, and certain filtered web content. The Google Assistant was trained on 11,000 romance novels taken from self-publishing site Smashwords to make it more conversational.

Advertisement
Syllo

I don’t know about you, but nothing screams “high-quality content” to me more than a good romance novel! 😉

Of course, we’re training the models as we go, which was part of my consideration for this morning’s post on five ways to get more out of AI chatbots like ChatGPT. The more we use these tools well, the more useful they will become for us.

The issue isn’t that “we could run out of data by 2026”, it’s the lack of high-quality training data that’s the issue. The author discusses three potential solutions to help with the issue – one of which is synthetic data (i.e., data created by AI to train AI). Based on my reading, synthetic data could become the predominant source of AI training data, if it isn’t already. We’ll see if it will be the answer. Based on what AI appears to be doing to Google, we have a lot of work to do. 🙁

So, what do you think? Are you concerned about the availability of good training data for AI models? Please share any comments you might have or if you’d like to know more about a particular topic.

Image created using GPT-4’s Image Creator Powered by DALL-E, using the term “puzzled robots looking at a blank computer screen in an office”.

Disclaimer: The views represented herein are exclusively the views of the authors and speakers themselves, and do not necessarily represent the views held by my employer, my partners or my clients. eDiscovery Today is made available solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Today should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.


Discover more from eDiscovery Today by Doug Austin

Subscribe to get the latest posts sent to your email.

Leave a Reply