What Tom O wants, Tom O gets. Yesterday, he mentioned the study DELEGATE-52, which shows that LLMs corrupt your documents over time.
DELEGATE-52 is a benchmark dataset and simulation methodology developed by Microsoft Research to evaluate how Large Language Models (LLMs) handle long, delegated document-editing workflows across 52 professional domains (e.g., Python code, accounting ledgers, and music notation).
The study, which was submitted back in April, evaluated 19 LLMs, including current “frontier” LLMs: Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4. The main metric is the reconstruction score after k interactions (i.e., k/2 round-trips) where the model is given a forward editing task (e.g., “split the ledger into separate files by expense category”) followed by an inverse task (e.g., merge all these category ledger files into a single accounting.ledger file).
The results are shocking. After 20 interactions, even top-tier models corrupted or lost an average of 25% of the document content. Across all 19 models, the average degradation was a staggering 50%. As the study notes: “Current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction.”
The following table highlights the Reconstruction Scores (RS@k) for selected models after 2 and 20 interactions. All tested models show a significant decline in document integrity as interactions increase.
| Model | RS@2 (Short-Term) | RS@20 (Long-Horizon) | Total Degradation |
| Gemini 3.1 Pro | 96.8 | 80.9 | 19.1% |
| Claude 4.6 Opus | 94.2 | 73.1 | 26.9% |
| GPT 5.4 | 94.3 | 71.5 | 28.5% |
| Claude 4.6 Sonnet | 92.2 | 66.0 | 34.0% |
| Grok 4 | 91.7 | 59.3 | 40.7% |
| GPT 5 | 91.5 | 48.3 | 51.7% |
| o1 | 86.4 | 48.1 | 51.9% |
| Mistral Large 3 | 82.4 | 35.5 | 64.5% |
| GPT 4o | 45.6 | 14.7 | 85.3% |
| GPT 5 Nano | 30.3 | 10.0 | 90.0% |
Other notable findings:
- Python is the only domain among the 52 tested where the majority of models (17 out of 19) achieved “ready” status (defined as a greater than or equal to 98% reconstruction score).
- Agentic tool use actually increased document degradation by an average of 6%.
- Document degradation is rarely a “death by a thousand cuts.” Instead, it is characterized by Critical Failures—sudden drops of 10 points or more in a single interaction. These massive errors account for 80% of total document damage.
- The risk of corruption scales aggressively with the complexity of the project. The study found that document size and interaction length compound multiplicatively (e.g., for GPT 5.4, a 10k token document degraded significantly faster than a 1k document).
The conclusion for the DELEGATE-52 study is this:
“We find that current LLMs are unreliable delegates: even frontier models corrupt an average of 25% of document content over long workflows, with sparse but severe errors that silently compound over time. Our analysis shows that degradation worsens with document length, interaction horizon, and distractor context, and is not mitigated by agentic tool use. These results highlight a fundamental gap in reliability that undermines trust in delegation.”
It’s a very dense 36-page study (and, yes, I used AI to help get through it – ironic, isn’t it? 😉). But the study does illustrate what my friend Tom O’Connor was saying yesterday – that LLMs corrupt your documents over time. Ruh-roh!
P.S.: More irony. When I went to Bing to use DALL-E 3 to create an image for this post, Bing was down. 😠
So, what do you think? Are you concerned about the findings in the DELEGATE-52 study? Please share any comments you might have or if you’d like to know more about a particular topic.
Image created using ChatGPT, using the term “robot lawyer wearing a suit doing a faceplant”.
Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by my employer, my partners or my clients. eDiscovery Today is made available solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Today should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.
Discover more from eDiscovery Today by Doug Austin
Subscribe to get the latest posts sent to your email.




Ouch! The DELEGATE‑52 results are deeply concerning. A benchmark this broad with 52 domains and 19 models shows that even frontier LLMs introduce sparse but severe errors that silently compound over long workflows. With top models losing ~25% of content after 20 interactions and average degradation around 50%, document integrity and workflow reliability remain unresolved, present non‑negotiable issues. This appears to be a foundational reliability gap, not a cosmetic one.