I don’t have to tell you how much consumer data is out there – you already know that. It’s one of the biggest reasons why we expect 463 exabytes (over 463 million terabytes!) of data to be created each day globally by 2025. Artificial Intelligence (AI) plays a big part of that reason for the mushrooming growth of personal data. It’s a big mess. Now, with data privacy laws strengthening, since AI helped get us into this mess, can it help get us out?
AI’s Role in Tracking of Personal Data
In addition to the personal data you know that you’re sharing, there is a lot of data you’re sharing through AI algorithms and you may not realize just how extensively that’s happening. Do you have a Facebook profile? Perform searches on Google? Or even simply browse the web? Then, AI mechanisms such as search algorithms, recommendation engines, adtech networks and facial recognition systems – driven by machine learning algorithms – are almost literally tracking your every move online.
For example, if you go to a retailer’s web site and then move on to other sites, AI algorithms are the reason you often see ads for that retailer continue to pop up as you surf the net. And, if a Facebook friend uploads a picture of a group gathering that you were included in, facial recognition enables Facebook to often identify you as being included in that picture, even without being tagged. The examples and algorithms tracking your online activities are practically endless.
Data Privacy Laws to the Rescue?
As a result of all of the personal data being tracked, governments have begun to take action to protect data privacy rights of individuals. In 2018, the European Economic Area (EEA) – which is slightly larger than the European Union – implemented the General Data Protection Regulation (GDPR) and, this year, California implemented the California Consumer Privacy Act (CCPA) this year. Many other states and countries have implemented – or are working to implement – their own laws.
Impact and Challenges for eDiscovery
So, how does that impact eDiscovery litigation and compliance workflows? The ability to identify Personally Identifiable Information (PII) and subsets of PII such as Protected Health Information (PHI) have placed an additional requirement on discovery teams in many places to take action to minimize the distribution of that PII. This has increased the need to add potential redaction of that information to eDiscovery workflows for litigation and dramatically increased compliance requirements.
Identifying PII within ESI collections growing exponentially can be extremely difficult. Here are two mechanisms typically employed by discovery teams today to identify and redact PII:
- “Brute Force” Review: Either as part of existing document review workflows, or as additional workflows, teams are adding the location and redaction of PII to their responsibilities for review teams. Needless to say, “brute force” review of documents to locate and redact PII is time-consuming and costly. There may be certain terms that can be identified as always PII-related to help locate some of that PII, but a lot of PII can’t be identified that way. If the names “Doug Austin” or “Barry White” appear in a document, you can’t simply redact all instances of “Austin” (because it’s a city) or “White” (because it’s a color).
- Regular Expressions: There is some PII that can be identified through regular expression (RegEx) pattern matching. Using RegEx to identify things like social security numbers, phone numbers, driver’s license numbers and credit card numbers can help automate the identification (and even redaction) of some PII. But, RegEx doesn’t help with other PII, such as addresses or health history.
Using AI to Identify PII
If you’re an experienced eDiscovery professional, you’re probably familiar with technologies such as clustering and predictive coding used to accelerate the process of identifying groups of responsive documents, using unsupervised and supervised machine learning technologies respectively. In addition, natural language processing (NLP) AI algorithms can be used to automatically interpret words based on the context in which the words are used – i.e., to differentiate those uses of “Austin” and “White” I mentioned earlier.
The use of linguistic models to use as many as thousands of complex search terms designed by linguists to classify PII is becoming the next step to identifying PII more efficiently. At the very least, the ability to more precisely identify potential PII will help streamline the review process for that PII, by enabling reviewers in many cases to simply confirm the PII identified by the AI algorithm. Will the next step be to automatically redact it? Perhaps someday. Regardless, the use of AI algorithms to identify PII will certainly be one of the biggest growth areas for eDiscovery technology over the next several years. AI helped get us into this mess, let’s see if it can help get us out!
So, what do you think? How have you addressed identification of PII in your discovery workflows? Please share any comments you might have or if you’d like to know more about a particular topic.
Disclaimer: The views represented herein are exclusively the views of the authors and speakers themselves, and do not necessarily represent the views held by my employer, my partners or my clients. eDiscovery Today is made available solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Today should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.