Big Data Isn't Big Enough for AI Models

In the Big Data era of today, data is growing faster than ever. But Big Data isn’t big enough for AI models, so they’re getting creative to get it.

This past weekend, The New York Times published an article (How Tech Giants Cut Corners to Harvest Data for A.I., written by Cade Metz, Cecilia Kang, Sheera Frenkel, Stuart A. Thompson and Nico Grant and available here). In it, the authors detailed how OpenAI, Google and Meta are resorting to desperate measures to find reputable English-language text sources of data they can use to train their models.

OpenAI had to get so creative that their researchers created a speech recognition tool called Whisper, which could transcribe the audio from YouTube videos, yielding new conversational text that would make an AI system smarter. In fact, OpenAI transcribed over a million hours of YouTube videos to train GPT-4.

Great, right? Not if you care about YouTube’s rules (which prohibits use of its videos for applications that are “independent” of the video platform), or copyright law. YouTube is owned by Google. So, why aren’t they making a big fuss about it?

Probably because they’re doing it too. In fact, Google changed their privacy policy terms last year over the Fourth of July weekend, when people were typically focused on the holiday weekend, to indicate they “use publicly available information to help train Google’s ~~language~~ AI models and build products and features like Google Translate, Bard, and Cloud AI capabilities.”

We’ve already seen copyright lawsuits, as The Times noted, citing their own lawsuit against OpenAI and Microsoft filed late last year. More than 10,000 trade groups, authors, companies and others submitted comments last year about the use of creative works by AI models to the Copyright Office, a federal agency that is preparing guidance on how copyright law applies in the AI era.

What do you do when Big Data isn’t big enough for AI models? You start making it yourself – via synthetic data which is then being used to train the models. In other words, the models are – or will be – creating their own training data. Can that work effectively? We’ll see.

Hat tip to Greg Bufithis for the heads up on the (lengthy, but interesting) NYT article!

So, what do you think? Are you surprised that Big Data isn’t big enough for AI models? Please share any comments you might have or if you’d like to know more about a particular topic.

Image created using Bing Image Creator Powered by DALL-E, using the term “dozens of duplicate robots standing side by side”.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by my employer, my partners or my clients. eDiscovery Today is made available solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Today should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Discover more from eDiscovery Today by Doug Austin

Subscribe to get the latest posts sent to your email.

One comment

US Newspapers Are Suing OpenAI for Copyright Infringement says:

May 2, 2024 at 6:30 am

[…] sure seems like the roadblocks are piling up. People are saying there’s not enough data for AI models, they take too much power, they’re leading to too many fraudulent claims in class […]

Loading...

eDiscovery Today by Doug Austin

eDiscovery Today – Doug Austin

Big Data Isn’t Big Enough for AI Models: Artificial Intelligence Trends

Like this:

Related

Discover more from eDiscovery Today by Doug Austin

One comment

Leave a ReplyCancel reply

Big Data Isn’t Big Enough for AI Models: Artificial Intelligence Trends

Related Posts

Share this:

Like this:

Related

Discover more from eDiscovery Today by Doug Austin

One comment

Leave a ReplyCancel reply

Discover more from eDiscovery Today by Doug Austin