Because Ranson was so bad at explaining how Copilot works, Schopf took the extra time to actually try to use Copilot to generate the estimates that Ranson got—and he could not.

Each time, the court entered the same query into Copilot—"Can you calculate the value of $250,000 invested in the Vanguard Balanced Index Fund from December 31, 2004 through January 31, 2021?"—and each time Copilot generated a slightly different answer.

This "calls into question the reliability and accuracy of Copilot to generate evidence to be relied upon in a court proceeding," Schopf wrote.

Chatbot not to blame, judge says

While the court was experimenting with Copilot, they also probed the chatbot for answers to a more Big Picture legal question: Are Copilot's responses accurate enough to be cited in court?

The court found that Copilot had less faith in its outputs than Ranson seemingly did. When asked "are you accurate" or "reliable," Copilot responded that "my accuracy is only as good as my sources, so for critical matters, it's always wise to verify." When more specifically asked, "Are your calculations reliable enough for use in court," Copilot similarly recommended that outputs "should always be verified by experts and accompanied by professional evaluations before being used in court."

Although it seemed clear that Ranson did not verify outputs before using them in court, Schopf noted that at least "developers of the Copilot program recognize the need for its supervision by a trained human operator to verify the accuracy of the submitted information as well as the output."

Microsoft declined Ars' request to comment.

Until a bright-line rule exists telling courts when to accept AI-generated testimony, Schopf suggested that courts should require disclosures from lawyers to stop chatbot-spouted inadmissible testimony from disrupting the legal system.