Effectively Organizing Documents with GenAI

Effectively Organizing Documents with GenAI Review: eDiscovery Trends

Here’s a terrific case study on practical testing for effectively organizing documents with GenAI review for legal workflows.

Published by EDRM last week (Case Study: Practical Testing for Effectively Organizing Documents with GenAI Review, available here), the case study was authored by Tara S. Emory, Special Counsel at Covington & Burling LLP, an experienced legal technology lawyer focusing on AI legal applications, eDiscovery, and Information Governance. It involves the practical application and evaluation of GenAI for organizing legal documents in discovery and was conducted by Tara using Relativity’s aiR for Review, outlines a systematic testing framework to assess GenAI’s performance in categorizing over 7,000 conceptually similar documents across nine nuanced sub-issues..

The study addressed a “relatively urgent” need to organize 7,100 documents, previously tagged with a broad issue, into nine more specific, nuanced, and interrelated sub-issues. Traditional review methods were anticipated to lead to inconsistency and be time-consuming, given the complexity and the need for high-level subject matter expertise. So, the team decided to use aiR’s GenAI-based Issues Review, which uses GenAI to review documents and predict responsiveness to multiple issues, based on user prompts.

Advertisement
Minerva26

aiR scores documents on a 1-4 scale, where 1= not relevant; 2=borderline; 3=relevant; and 4=very relevant. While a “standard relevance cutoff score of 3” was used for most issues, Tara noted that “for Issues 2 and 7, we included the borderline documents with a score of 2.” This decision was based on testing showing it “captured significantly more responsive documents with reasonable false positive increases, justifying the tradeoff.”

As Tara noted: “Unlike validation-driven TAR protocols often developed with defensibility concerns in mind, our objective was to develop a practical and efficient method to evaluate whether prompts were performing well enough for document organization, prioritization, and issue understanding. Formal validation across nine issues would have been too time-intensive and not aligned with our goals.”

The study employed a three-tier testing framework to evaluate GenAI performance and prevent overfitting:

  • Test Set 1 (Curated Development Set): Comprised 10-15 example documents per issue category, “carefully selected by our subject matter expert to represent the full range of responsiveness.” This served as the “primary development set, where we iteratively refined and measured prompts across multiple rounds of testing.”
  • Test Set 2 (Curated Check for Overfitting): “Drawn from the same curated pool as Test Set 1” but kept separate during initial prompt development, it was used to “check against the risk of overfitting” – ensuring the AI learned concepts rather than just memorizing examples from Test Set 1.
  • Test Set 3 (Representative Performance Test): A limited random sample from the broader review set, more representative of real-world documents, including those not responsive to the sub-issues. It provided an “idea of how the system would perform on responsive documents in the overall review set.”

The review tiers each served to help us assess the questions:

Advertisement
Nextpoint
  • Test Set 1: Can we teach the system our concept?
  • Test Set 2: Did the system learn the concept or memorize examples? If it only memorized examples, can we further iterate and improve it?
  • Test Set 3: How will the system perform in practice?

The focus was on “raw document counts instead of statistical measures” like recall and precision, as the small, curated sets could not be reliably extrapolated for overall performance estimates. Raw counts provided “clearer context for the practical significance of our results.”

Prompt refinement involved “four rounds of prompting” focused on “increasing true positives” (finding more relevant documents) and “decreasing aiR’s incorrect predictions of responsiveness for documents that were not responsive (i.e., reducing false positives).” Senior attorneys provided specific feedback on misclassified documents to refine prompts. Here’s an illustration of the four rounds of prompt iterations for Test Set 1.

Test Set 1: Four Rounds of Prompt Iterations (Standard Cutoff Score) (Courtesy EDRM)

As Tara noted: The iterative process “revealed how we could improve our aiR results through a hybrid approach, supplementing or narrowing results with other search methods, including metadata filters and targeted search terms.”

So, what were the results? “In all, of the 7,100 documents, aiR analyzed 6,622 without error. Of those, using the standard cutoff scores, it predicted 2,004 documents contained at least one issue, and 4,618 contained none. A sample of those documents predicted not responsive was 9% relevant, though those responsive documents in the sample were of low importance.”

Tara also noted: “Our testing showed that GenAI review worked better for some issues than others. Issues requiring nuanced content analysis generally performed well. Issues with a rules-based component such as multiple criteria (e.g., mentioning a company plus discussing a topic), or dependent on date, had more limited performance. For example, Issue 8 was tied to specific date ranges and particular parties, and persistently produced higher false positive rates. As the nature of the documents became clearer through our reviews of iterative prompt results, we identified which issues would benefit more from metadata and keyword searches than additional prompt refinement. We supplemented Issue 5 with additional searches, while Issues 8 and 9 required both narrowing and supplementation.”

The three lessons demonstrated for GenAI review in legal workflows were as follows:

  1. “[T]he upfront investment in prompt development and testing can be substantial (especially for multiple issue prompts), but can enable senior attorneys to scale their expertise for review involving complex, nuanced issues.”
  2. “[S]ystematic monitoring throughout the process can inform both prompt refinement and strategic decisions about score cutoffs.”
  3. “GenAI review is another search tool that can be combined effectively with traditional search methods.”

Again, the full case study (which goes into a lot more depth) is available on the EDRM site here. I also asked Tara a follow-up question about the study and she provided an in-depth response that I will share next week.

So, what do you think? Do you think this case study provides a good methodology for effectively organizing documents with GenAI review? Please share any comments you might have or if you’d like to know more about a particular topic.

Image created using Microsoft Designer, using the term “robot lawyer with nine ‘in’ boxes filled with paper”. Well, sort of. 😉

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by my employer, my partners or my clients. eDiscovery Today is made available solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Today should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.


Discover more from eDiscovery Today by Doug Austin

Subscribe to get the latest posts sent to your email.

One comment

Leave a Reply