Judge Used Copilot to Check Expert's Work & Got 3 Different Answers

In a fiduciary duty dispute, Microsoft Copilot figured in the discussion. The judge used Copilot to check an expert’s work in the case. It did not go well.

The case ruling was issued by Judge Jonathan G. Schopf of the Surrogate’s Court of Saratoga County. The Objectant in the case, Owen K. Weber, claimed that the trustee Susan F. Weber breached her fiduciary duty by retaining a particular property and using it for personal travel. The Objectant’s expert, Charles Ranson, argued the property should have been sold earlier and reinvested. However, Judge Schopf found his testimony speculative and lacking in real estate expertise.

So, where does Copilot come into play? As Judge Schopf stated: “the testimony revealed that Mr. Ranson relied on Microsoft Copilot, a large language model generative artificial intelligence chatbot, in cross-checking his calculations.” Continuing, he said: “Despite his reliance on artificial intelligence, Mr. Ranson could not recall what input or prompt he used to assist him with the Supplemental Damages Report. He also could not state what sources Copilot relied upon and could not explain any details about how Copilot works or how it arrives at a given output. There was no testimony on whether these Copilot calculations considered any fund fees or tax implications.”

Despite the fact that Judge Schopf “has no objective understanding as to how Copilot works”, he stated: “To illustrate the concern with this, the Court entered the following prompt into Microsoft Copilot on its Unified Court System (UCS) issued computer: ‘Can you calculate the value of $250,000 invested in the Vanguard Balanced Index Fund from December 31, 2004 through January 31, 2021?’ and it returned a value of $949,070.97 — a number different than Mr. Ranson’s. Upon running this same query on two (2) additional UCS computers, it returned values of $948,209.63 and a little more than $951,000.00, respectively. While these resulting variations are not large, the fact there are variations at all calls into question the reliability and accuracy of Copilot to generate evidence to be relied upon in a court proceeding.”

In other words, the judge used Copilot to check Ranson’s work, asking it the same question three times and getting three different answers.

Continuing, Judge Schopf stated: “Interestingly, when asked the following question: ‘are you accurate’, Copilot generated the following answer: ‘I aim to be accurate within the data I’ve been trained on and the information I can find for you. That said, my accuracy is only as good as my sources so for critical matters, it’s always wise to verify.’” Judge Schopf provided a footnote to the answer, stating: “This brings to mind the old adage, ‘garbage in, garbage out’. Clearly a user of Copilot and other artificial intelligence software must be trained or have knowledge of the appropriate inputs to ensure the most accurate results.”

Judge Schopf also asked about reliability with a follow-up question of: “are your calculations reliable enough for use in court”, to which Copilot responded with “[w]hen it comes to legal matters, any calculations or data need to meet strict standards. I can provide accurate info, but it should always be verified by experts and accompanied by professional evaluations before being used in court…”

In noting that AI is “an emerging issue that trial courts are beginning to grapple with and for which it does not appear that a bright-line rule exists”, Judge Schopf stated: “The use of artificial intelligence is a rapidly growing reality across many industries. The mere fact that artificial intelligence has played a role, which continues to expand in our everyday lives, does not make the results generated by artificial intelligence admissible in Court.” While citing People v. Wakefield, 38 NY3d 367 [2022], Judge Schopf stated: “The Court of Appeals has found that certain industry specific artificial intelligence technology is generally accepted…However, Wakefield involved a full Frye hearing that included expert testimony that explained the mathematical formulas, the processes involved, and the peer-reviewed published articles in scientific journals.” [link added]

In his conclusion on the issue (and ruling against the Objectant and for the Petitioner), Judge Schopf stated: “In what may be an issue of first impression, at least in Surrogate’s Court practice, this Court holds that due to the nature of the rapid evolution of artificial intelligence and its inherent reliability issues that prior to evidence being introduced which has been generated by an artificial intelligence product or system, counsel has an affirmative duty to disclose the use of artificial intelligence and the evidence sought to be admitted should properly be subject to a Frye hearing prior to its admission, the scope of which should be determined by the Court, either in a pre-trial hearing or at the time the evidence is offered.”

One of the big concerns about large language models I’ve seen again and again is how it can be asked the same question multiple times and give a different answer each time. For generated AI content to be used as evidence in the courtroom, that’s a concern that will need to be addressed. Otherwise, expect similar concerns raised by courts as Judge Schopf did here.

So, what do you think? Are you surprised that the judge used Copilot to check on the expert’s work and got a different answer each time? Please share any comments you might have or if you’d like to know more about a particular topic.

Hat tip to Maura R. Grossman for the heads up on this case!

Image created using GPT-4o’s Image Creator Powered by DALL-E, using the term “robot judge doing a faceplant when looking at a computer”.

Disclaimer: The views represented herein are exclusively the views of the authors and speakers themselves, and do not necessarily represent the views held by my employer, my partners or my clients. eDiscovery Today is made available solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Today should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Discover more from eDiscovery Today by Doug Austin

Subscribe to get the latest posts sent to your email.

5 comments

Ralph Artigliere says:

October 15, 2024 at 8:18 am

Doug-

Thank you for shedding light on this important case.

The Surrogate got it right, of course. The expert’s work was rubbish. But it was helpful to all of us that the Surrogate addressed the issues of reliability of genAI for making complex damage calculations and the standard for scientific evidence and expert testimony in support of theuse of AI.

It is not surprising that a GPT would come up with three different answers to the same prompt under these circumstances. Generative AI is not reliable for math. Sometimes it cannot count the number of a certain letter in a word, let alone do complex damages calculations as was attempted in this case. The fact that the “expert” Ranson was unable to support his own methodology for calculating the damages to the judge. Copilot was used to “verify” his erroneous calculations. Ranson could not provide the prompt he used, nor could he recall the sources Copilot used in coming up with an answer. It was a wonder that Copilot came as close as it did to an answer. The best lesson of this opinion is this advice from Copilot when asked if its “calculations [are] reliable enough for use in court”: “Copilot responded with ‘[w]hen it comes to legal matters, any calculations or data need to meet strict standards. I can provide accurate info, but it should always be verified by experts and accompanied by professional evaluations before being used in court.'”

No surprise that Maura Grossman gave you this case just days after it was decided. That remarkable person is on top of it, which helps all of us.

Ralph Artigliere

Loading...

Copilot Automatically Summarizes Word Documents says:

October 15, 2024 at 2:40 pm

[…] an earlier post today related to Microsoft Copilot, here’s one for users of Microsoft Word: Copilot automatically […]

Loading...

Kevin Marold says:

October 17, 2024 at 9:17 am

Fascinating topic and analysis. However, I’m left wondering about the specifics of the expert’s report. The expert used Copilot to “cross-check his calculations,” implying his primary work wasn’t done by AI. But did he provide transparency on his original calculations and sources? If he relied heavily on Copilot without verifying the AI’s accuracy or properly validating inputs and formulas, it clearly suggests overreliance. This situation raises a broader question: Is there a balanced role for AI in reviewing expert analysis and calculations without undermining the expert’s diligence?

Loading...

The Kitchen Sink for October 25, 2024 says:

October 25, 2024 at 12:32 pm

[…] copilot: OK, Copilot has taken a lot of flak the past couple of weeks, including from me here and here. But given that Salesforce’s boss has a competing product, I think we can take this one […]

Loading...

Six Reasons Why Gen AI May Replace TAR...Someday says:

November 7, 2024 at 6:30 am

[…] where gen AI models have hallucinated or generated different responses to prompts that are identical or only vary slightly. Unpredictability is a huge barrier when you have opposing counsel pushing […]

Loading...

eDiscovery Today by Doug Austin

eDiscovery Today – Doug Austin

Judge Used Copilot to Check Expert’s Work & Got 3 Different Answers: Artificial Intelligence Trends

Like this:

Related

Discover more from eDiscovery Today by Doug Austin

5 comments

Leave a ReplyCancel reply

Judge Used Copilot to Check Expert’s Work & Got 3 Different Answers: Artificial Intelligence Trends

Share this:

Like this:

Related

Discover more from eDiscovery Today by Doug Austin

5 comments

Leave a ReplyCancel reply

Discover more from eDiscovery Today by Doug Austin