Actually, it’s beyond time to retire the Enron email corpus as Craig Ball notes in his latest blog post. Here’s why we need a new dataset for eDiscovery.
In Craig’s post (Still on Dial-Up: Why It’s Time to Retire the Enron Email Corpus, available here), he discusses how he was hired to work as the lead computer forensic examiner for plaintiffs in the Enron case over two decades ago.
As Craig notes. when Enron collapsed in 2001 amid accounting fraud and market-manipulation scandals, the U.S. Federal Energy Regulatory Commission (FERC) launched a sweeping investigation into abuses during the Western U.S. energy crisis. As part of that probe, FERC collected huge volumes of internal Enron email. In 2003, in an extraordinary act of transparency, FERC made a subset of those emails public as part of its docket.
Eventually, Carnegie Mellon University’s School of Computer Science downloaded the FERC release, cleaned and structured it into individual mailboxes, and published it for research. EDRM then stepped in to make the corpus more accessible to the legal tech world. EDRM curated, repackaged, and hosted improved versions, including PST-structured mailboxes and more comprehensive metadata (though they no longer host it). While Enron is long gone as a company, the Enron Email Corpus – eDiscovery’s default demo dataset – is still here and is being used to test and demonstrate eDiscovery solutions as much as ever.
As Craig notes, its virtues are obvious:
- Free and lawful to use
- Large enough (over half a million messages from about 150 Enron employees) to exercise search and analytics tools
- Real corporate communications with all their messy quirks
- Familiar to the point of being an industry standard
But, as Craig says, those virtues are also the trap. The data is from 2001—before smartphones, Teams, Slack, Zoom, linked attachments, and nearly every other element that makes modern email review challenging.
Not only that, it’s even a poor data set to illustrate traditional email. The data originated in Lotus Notes before being repackaged into Outlook PST files, so things like metadata to indicate message threads aren’t there either. These are all reasons why it may be time to retire the Enron email corpus.
Craig goes on to discuss why (despite its flaws) there will never be another dataset like Enron. And he discusses the need to build datasets that reflect the present from one or more of these: synthetic corpora, FOIA-based collections, anonymized donor data and blended corpora.
I actually thought I had found a potential replacement at ILTACON last week. During a press meeting with Everlaw, their CEO and Founder AJ Shankar, was conducting a demo using documents from the Opioid Industry Documents Archive which was created by University of California at San Francisco (UCSF) and Johns Hopkins University in 2021. Everlaw’s database was 1.2 million documents – out of nearly 5.4 million total documents (nearly 2.8 million of those are emails). Wow!
The problem with this collection is two-fold: 1) You can’t download the entire collection (or even large chunks of it) – Everlaw has the collection they have because UCSF used the Everlaw platform to upload and redact these files for their public-facing archive, 2) the files are converted to PDF format, which makes it difficult to impossible to test some of the very conditions Craig discusses in his post without key metadata, embedded or linked files, etc.
So, the search continues. Craig has more details about the benefits and restrictions of the Enron dataset in his post here (including a shout out to me about a 2019 blog post I wrote on the topic, which indicates how long this has been an issue!).
Regarding the search for a replacement, I expect to have more on that in the next few days. Stay tuned!
So, what do you think? Do you think it’s time to retire the Enron email corpus? Or is it still useful for eDiscovery solution testing and demos? Please share any comments you might have or if you’d like to know more about a particular topic.
Image created using Microsoft Designer, using the term “robot preacher speaking to a choir of robots”. Get it? Preaching to the choir! 🤣
Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by my employer, my partners or my clients. eDiscovery Today is made available solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Today should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.
Discover more from eDiscovery Today by Doug Austin
Subscribe to get the latest posts sent to your email.




[…] here’s why the Enron emails work as a dataset. As Austin and Ball write, they […]
[…] here’s why the Enron emails work as a dataset. As Austin and Ball write, they […]