Sidley’s GPT-4 Test

Sidley’s GPT-4 Test in Document Review: Artificial Intelligence Trends

Some of you have already read about this, but some haven’t, so I felt I should provide a heads-up on Sidley’s GPT-4 test in document review!

The high-level results of the test are captured in this article in American Lawyer/Legaltech® News (Replacing Attorney Review? Sidley’s Experimental Assessment of GPT-4’s Performance in Document Review, written by Colleen M. Kenney, Matt S. Jackson, and Robert D. Keeling and available here).

To better understand the current capabilities of GPT-4 for eDiscovery, Sidley Austin collaborated with Relativity to evaluate how standard GPT-4 would perform in coding documents for responsiveness. The parameters of Sidley’s GPT-4 test in document review included:

Advertisement
Everlaw
  • A prior, closed case in which documents had been coded by human reviewers for responsiveness. The now closed case involved a subpoena related to potential violations of the Anti-Kickback Statute. The subpoena requested documents that were responsive to 19 different document requests.
  • A representative sample of these documents that reflected the richness of the total corpus of documents. It included 1,500 total documents from the closed case: 500 responsive documents and 1,000 non-responsive documents.
  • They provided document review instructions for GPT-4 that mirrored the review instructions employed by the attorneys who had reviewed those documents.

GPT-4 evaluated each document individually, based on the review instructions, and reported whether the document was responsive according to a scoring system of negative one to four, as follows:

  • -1 if the document could not be processed;
  • 0 if the document contained non-responsive junk or no useful information;
  • 1 if the document contained non-responsive or irrelevant information;
  • 2 if the document was likely responsive or contained partial or somewhat relevant information;
  • 3 if the document was responsive; and
  • 4 if the document was responsive and contained direct and strong evidence described in the responsiveness criteria.

The experiment proceeded in two stages:

  • Sidley provided GPT-4 with the same review instructions given to the attorneys and collected data on GPT-4’s performance relative to the human review;
  • Based on initial output from stage one, the prompt for GPT-4 was modified to address ambiguities in the responsiveness criteria, which mirrored a quality control (QC) feedback loop that provided the same additional information given to the contract attorneys outside the original review instructions.

Once these experimental adjustments were made, GPT-4 performed well: 75.9% of responsive documents and 84.8% of non-responsive documents were correctly identified by GPT-4! Very respectable results!

There’s a lot more granularity to the results, including some of GPT-4’s limitations, how it compares throughput-wise to TAR and how it does on the little confidence, some confidence and high confidence assessments, but I won’t steal their thunder on the rest of the results. Check them out here. If anything, Sidley’s GPT-4 test in document review shows us that we may be closer to the mainstream use of generative AI tools for document review than we think!

Advertisement
ProSearch

So, what do you think? Do you think we’re close to seeing mainstream use of generative AI tools for document review? Please share any comments you might have or if you’d like to know more about a particular topic.

Image created using GPT-4’s Image Creator Powered by DALL-E, using the term “robots reviewing documents on computers”.

Disclaimer: The views represented herein are exclusively the views of the authors and speakers themselves, and do not necessarily represent the views held by my employer, my partners or my clients. eDiscovery Today is made available solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Today should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.


Discover more from eDiscovery Today by Doug Austin

Subscribe to get the latest posts sent to your email.

2 comments

  1. If I understand the (incredibly rich) corpus, one of three documents was responsive. So, if instead, the wind blew the corpus into two random piles, the wind would succeed in “identifying” 250 responsive documents in either pile, along with 500 non-responsive ones. Chat GPT succeeded in identifying 380 responsive documents along with 228 non-responsive ones.

    Put differently, ChatGPT missed 24% of the responsive documents, so 120 documents that should have been produced were not, and 228 non-responsive documents that shouldn’t have been produced were. Of course, all of this was (quite rightly) measured against the “gold standard” assessments of human reviewers, implicating other issues, considering that studies have shown that human reviewers aren’t that good at assessing responsiveness.

    It’s great that users are doing these experiments and sharing their metrics. Many thanks to Robert Keeling for contributing sound metrics.

  2. Craig, you can’t have “wind aided” without “AI”… 😉

    Seems to compare quite favorably to the gold standard” assessments of human reviewers that we’re all familiar with. Of course, it’s only one test. I hope we see many other tests published as well. 🙂

Leave a Reply