DELEGATE-52 Shows That LLMs Corrupt Your Documents Over Time

What Tom O wants, Tom O gets. Yesterday, he mentioned the study DELEGATE-52, which shows that LLMs corrupt your documents over time.

DELEGATE-52 is a benchmark dataset and simulation methodology developed by Microsoft Research to evaluate how Large Language Models (LLMs) handle long, delegated document-editing workflows across 52 professional domains (e.g., Python code, accounting ledgers, and music notation).

The study, which was submitted back in April, evaluated 19 LLMs, including current “frontier” LLMs: Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4. The main metric is the reconstruction score after k interactions (i.e., k/2 round-trips) where the model is given a forward editing task (e.g., “split the ledger into separate files by expense category”) followed by an inverse task (e.g., merge all these category ledger files into a single accounting.ledger file).

The results are shocking. After 20 interactions, even top-tier models corrupted or lost an average of 25% of the document content. Across all 19 models, the average degradation was a staggering 50%. As the study notes: “Current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction.”

The following table highlights the Reconstruction Scores (RS@k) for selected models after 2 and 20 interactions. All tested models show a significant decline in document integrity as interactions increase.

Model	RS@2 (Short-Term)	RS@20 (Long-Horizon)	Total Degradation
Gemini 3.1 Pro	96.8	80.9	19.1%
Claude 4.6 Opus	94.2	73.1	26.9%
GPT 5.4	94.3	71.5	28.5%
Claude 4.6 Sonnet	92.2	66.0	34.0%
Grok 4	91.7	59.3	40.7%
GPT 5	91.5	48.3	51.7%
o1	86.4	48.1	51.9%
Mistral Large 3	82.4	35.5	64.5%
GPT 4o	45.6	14.7	85.3%
GPT 5 Nano	30.3	10.0	90.0%

Other notable findings:

Python is the only domain among the 52 tested where the majority of models (17 out of 19) achieved “ready” status (defined as a greater than or equal to 98% reconstruction score).
Agentic tool use actually increased document degradation by an average of 6%.
Document degradation is rarely a “death by a thousand cuts.” Instead, it is characterized by Critical Failures—sudden drops of 10 points or more in a single interaction. These massive errors account for 80% of total document damage.
The risk of corruption scales aggressively with the complexity of the project. The study found that document size and interaction length compound multiplicatively (e.g., for GPT 5.4, a 10k token document degraded significantly faster than a 1k document).

The conclusion for the DELEGATE-52 study is this:

“We find that current LLMs are unreliable delegates: even frontier models corrupt an average of 25% of document content over long workflows, with sparse but severe errors that silently compound over time. Our analysis shows that degradation worsens with document length, interaction horizon, and distractor context, and is not mitigated by agentic tool use. These results highlight a fundamental gap in reliability that undermines trust in delegation.”

It’s a very dense 36-page study (and, yes, I used AI to help get through it – ironic, isn’t it? 😉). But the study does illustrate what my friend Tom O’Connor was saying yesterday – that LLMs corrupt your documents over time. Ruh-roh!

P.S.: More irony. When I went to Bing to use DALL-E 3 to create an image for this post, Bing was down. 😠

So, what do you think? Are you concerned about the findings in the DELEGATE-52 study? Please share any comments you might have or if you’d like to know more about a particular topic.

Image created using ChatGPT, using the term “robot lawyer wearing a suit doing a faceplant”.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by my employer, my partners or my clients. eDiscovery Today is made available solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Today should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Discover more from eDiscovery Today by Doug Austin

Subscribe to get the latest posts sent to your email.

2 comments

Ralph Artigliere says:

June 5, 2026 at 7:54 am

Ouch! The DELEGATE‑52 results are deeply concerning. A benchmark this broad with 52 domains and 19 models shows that even frontier LLMs introduce sparse but severe errors that silently compound over long workflows. With top models losing ~25% of content after 20 interactions and average degradation around 50%, document integrity and workflow reliability remain unresolved, present non‑negotiable issues. This appears to be a foundational reliability gap, not a cosmetic one.

Loading...

The Kitchen Sink for June 19, 2026 says:

June 19, 2026 at 11:15 am

[…] from his wife’s sheet music – ChatGPT did it, but “subtly altered the resulting PDFs”. It happens. But when he used ChatGPT to write a Python program to do it, it worked. […]

Loading...

eDiscovery Today by Doug Austin

eDiscovery Today – Doug Austin

DELEGATE-52 Shows That LLMs Corrupt Your Documents Over Time: Artificial Intelligence Trends

Like this:

Related

Discover more from eDiscovery Today by Doug Austin

2 comments

Leave a ReplyCancel reply

DELEGATE-52 Shows That LLMs Corrupt Your Documents Over Time: Artificial Intelligence Trends

Share this:

Like this:

Related

Discover more from eDiscovery Today by Doug Austin

2 comments

Leave a ReplyCancel reply

Discover more from eDiscovery Today by Doug Austin