Understanding the predictive coding process remains elusive for many legal professionals – they understand the idea behind predictive coding and the delivery of documents for review based on classifications of other documents, but they don’t understand the workflow of how the process actually works. Perhaps this predictive coding walkthrough of the process by Vound will help, especially since it’s conducted with a case and data set you know well.
Vound recently released a Case Study White Paper titled Understanding Predictive Coding Workflows using Intella Connect (available for download here) which not only discusses the flow of the predictive coding process (using their software Intella Connect, of course), but actually proceeds to provide a predictive coding walkthrough from start to finish using an actual case. The case study is organized into five sections:
Understanding the Data Set Used for Predictive Coding Exercise
The case study sets the stage by identifying the data set used for the predictive coding exercise that Vound conducted, which is the Enron data set, which (despite a growing push to find a more recent useful public domain data set) is still the best data set in terms of volume (over 2.7 million documents) and similarity to most discovery collections today (which is still predominantly email, Office and other work product files). But this section goes a step further to describe one of the primary issues of accounting irregularities that ultimately brought down Enron – the special purpose entities used to hide Enron’s debt – which were given ridiculous names like JEDI and Chewco (after Star Wars characters), LJM (the first initial of each of Enron CFO Andy Fastow’s three kids) and Raptor (named after the velociraptors in Jurassic Park). So, the stated goal of the exercise was to identify documents associated with six of those special purpose entities.
Identifying the Starting Candidate Data Set
In this section, Vound identified the beginning of the process of their multi-modal approach with the steps to get to their candidate data set for predictive coding, including retrieving and de-duping the set and excluding files not conducive to predictive coding. A couple of notes here:
- A multi-modal approach (search, then predictive coding) is common in predictive coding projects. Even court approved, as illustrated in this case. It’s common to exclude unlikely responsive documents via searching prior to conducting predictive coding.
- There are a lot of file types that aren’t conducive to predictive coding because of not enough usable text, too much text, too many numbers (e.g., spreadsheets), etc. Many people don’t realize that. This case study identifies the file types it excluded from the exercise.
About Intella Connect Predictive Coding
This section walks through the typical predictive coding process using Intella Connect, which uses a Continuous Active Learning® (CAL®) approach to predictive coding. As noted in the paper, in last year’s Predictive Coding Technologies and Protocols semi-annual survey conducted by Complex Discovery, Active Learning was reported as “the most used predictive coding protocol with 88.24% of responders using it in their predictive coding efforts.” It’s clear that CAL has become the most popular approach to predictive coding – by far.
The active learning aspect of CAL where the algorithm periodically re-ranks documents based on classifications of reviewed documents makes it an iterative process and a new iteration is started each time the algorithm re-ranks the documents (which can happen often at first and slow down over time as more information is available). To verify the results, an elusion test is conducted to identify responsive documents that have “eluded” detection by the algorithm to date. If too many have eluded detection, you revert to active learning for more training. Once the elusion test is complete and accepted, you can apply classifications to the remaining documents. This diagram illustrates that process:
Predictive Coding Exercise with Candidate Data Set
This is where Vound describes the actual predictive coding exercise using the candidate data set that was identified earlier in the case study. They used “existing tags” to teach the model, meaning that they started with 43 documents that had already been reviewed and classified before they started. Vound also illustrates the coding form the user sees during the process of continuing to train the model and documents the results of 32 active learning iterations that begin with mostly responsive documents at first, then a mix, then finally mostly non-responsive documents (which is an indication that the algorithm is running out of responsive documents to deliver). The case study does a good job of illustrating how there are variations along the way as the algorithm continues to learn what’s responsive (based on the classifications during review).
Vound also documents the elusion test that was conducted, including the desired recall rate of 75% to achieve a level of defensibility, the maximum eluded items allowed and the results illustrating that maximum number wasn’t exceeded. Finally, the exercise concludes by applying classifications to the remaining items not reviewed during active learning or the elusion test.
The case study concludes by identifying the number of documents that didn’t have to be reviewed by humans and the potential savings based on an assumed throughput for review and an assumed hourly rate for reviewers (obviously, both vary, so it’s an example of the potential savings). It’s probably worth noting that many predictive coding exercises don’t require such a high recall rate in the elusion test, so a smaller scale elusion test can be conducted in many instances, leading to the need to review even less documents and resulting in even greater savings.
The predictive coding walkthrough is documented in this thirteen-page case study at a level that can help “lay people” understand the workflow better. You can download a copy of the case study white paper illustrating the predictive coding walkthrough using Vound’s Intella Connect product here.
So, what do you think? Do you understand predictive coding workflows? Please share any comments you might have or if you’d like to know more about a particular topic.
Disclosure: Vound is an Educational Partner and sponsor of eDiscovery Today
Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by my employer, my partners or my clients. eDiscovery Today is made available solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Today should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.