Request for AI Training Data

Request for AI Training Data Partially Granted by Court: eDiscovery Case Law

In Kadrey v. Meta Platforms, Inc., No. 23-cv-03417-VC (TSH) (N.D. Cal. Jan. 17, 2025), California Magistrate Judge Thomas S. Hixson granted in part Plaintiffs’ request for AI training data, ordering Meta to “produce the post-training datasets used to train and finetune the Llama models specifically in reference to the ‘Intellectual Property’ safety category” and to “produce the SFT data identified by Bashlykov if it was used to finetune one or more Llama models for the intellectual property safety category.”

Case Discussion and Judge’s Ruling

In this case involving claims of copyright infringement against Meta over its Llama AI models, the Court ordered the parties to file a supplemental letter brief concerning RFP 118 requesting “[a]ll Documents and Communications, including source code, relating to any efforts, attempts, or measures implemented by Meta to prevent Llama Models from emitting or outputting copyrighted material”, which the parties did.

Advertisement
TransPerfect Legal

In its request for AI training data, Plaintiffs sought four categories of data:

  • The supervised fine-tuning data that Meta research engineer Nikolay Bashlykov testified about at pages 144-46 of his deposition.
  • The post-training datasets used to train and fine-tune the Llama models specifically in reference to the “Intellectual Property” safety category.
  • Post-training datasets comprising books sourced from the at-issue shadow datasets that are used for other safety categories.
  • Any additional post-training datasets sourced from shadow datasets and used by Meta to fine-tune its Llama models to minimize their ability to memorize or output training data verbatim.

As to all four categories of data, Plaintiffs sought both the raw/original data from which these post-training datasets were created, as well as the data as specifically formed or constituted for use in the aforementioned post-training of the Llama models.

Starting with the second category, Judge Hixson stated: “The Court thinks these datasets are responsive to RFP 118. It’s true that RFP 118 did not use the words ‘training data,’ but it did request ‘[a]ll Documents and Communications, including source code, relating to any efforts, attempts, or measures implemented by Meta to prevent Llama Models from emitting or outputting copyrighted material.’ The requested datasets are certainly ‘documents,’ and they seem to relate to efforts or attempts by Meta to prevent the Llama models from outputting copyrighted material.”

Continuing, he said: “The Court also thinks Plaintiffs have made a sufficient showing of relevance. Plaintiffs’ copyright claim is about Meta’s use of their copyrighted materials to train the Llama models. Plaintiffs allege that ‘Meta made copies of the Infringed Works during the training process of the Llama 1 and Llama 2 language models without Plaintiffs’ permission.’…The Llama 2 and 3 papers make clear that fine-tuning is part of the training process for the Llama models. Here, a big factual dispute between the parties is whether the fine-tuning data consists of the copyrighted works themselves. Plaintiffs strenuously argue that it does, and Meta denies this.”

Advertisement
Minerva26

Finding Bashlykov’s deposition testimony to be “ambiguous”, Judge Hixson also stated: “Plaintiffs cannot definitively prove that the fine-tuning datasets contain infringing works because they don’t have them. The Court is mindful that Plaintiffs are not required to prove their case on the merits in order to obtain discovery. Rather, Plaintiffs’ burden on a motion to compel is to show that the requested discovery is a worthwhile endeavor in view all of the factors in Rule 26(b)(1). Here, Plaintiffs have made a sufficient factual showing that the use of datasets that contain copyrighted works to create datasets that were used in fine-tuning the Llama models concerning intellectual property violations may have or could have resulted in portions of the copyrighted works ending up in the fine-tuning datasets, such that Plaintiffs are entitled to learn if that in fact happened.” So, he granted this category of the request for AI training data.

After denying the request for the third and fourth categories, Judge Hixson considered the first category, stating: “There is no evidence before the Court concerning whether this data was used to fine-tune for the intellectual property safety classification. If it was, it’s relevant and responsive. But if it was used only to fine-tune for other safety categories, then it’s not responsive to RFP 118, which is about preventing the Llama models from emitting or outputting copyrighted material.”

Finally, regarding Plaintiffs’ request for both the raw/original data from which the post-training datasets were created, as well as the data as specifically formed or constituted for use in the post-training of the Llama models. Judge Hixson stated: “Section 3.1 of the Llama 2 paper suggests that the raw or original data is not only massive compared to the datasets actually used, but also of dubious value… Section 5.4.7 of the Llama 3 paper also states that Meta did ‘extensive cleaning’ of collected samples to improve the performance of Llama Guard 3. The Court therefore concludes that the raw or original data from which the post-training datasets were created is not proportional to the needs of the case.” As a result, Judge Hixson granted Plaintiffs’ request for AI training data in part.

So, what do you think? Do you think we’ll see more requests for AI training data in litigation? Please share any comments you might have or if you’d like to know more about a particular topic.

Case opinion link courtesy of eDiscovery Assistant, an Affinity partner of eDiscovery Today.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by my employer, my partners or my clients. eDiscovery Today is made available solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Today should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.


Discover more from eDiscovery Today by Doug Austin

Subscribe to get the latest posts sent to your email.

Leave a Reply