A Simple Question for Testing AI Chatbots

I ran across an article that provided a simple question for testing AI chatbots to see how accurate they are. Here’s the question.

The question is part of this article from ZDNet (ChatGPT vs. Microsoft Copilot vs. Gemini: Which is the best AI chatbot?, written by Maria Diaz and available here). The basic answer to that question is “it depends” and Diaz provides a terrific discussion of factors to consider. But she also proposes this simple question for testing AI chatbots. Here is the prompt she uses:

“I have 5 oranges today, I ate 3 oranges last week. How many oranges do I have left?”

If you said “2”, go back and re-read the statement. It’s 5 because “I have 5 oranges today” and the oranges I ate last week don’t matter.

Here’s Diaz’s quick overview of how the AI chatbots performed with this test:

ChatGPT with GPT-4o: Succeeded
ChatGPT with GPT-4: Succeeded
ChatGPT with GPT-3.5: Failed
Microsoft Copilot in Creative: Succeeded
Microsoft Copilot in Balanced: Failed
Microsoft Copilot in Precise: Failed
Google Gemini: Succeeded
Google Gemini Advanced: Succeeded

Here are screenshots she provided to illustrate responses from GPT-4o, GPT-3.5 and Copilot in Balanced mode:

While Google Gemini got the answer right in both modes, that doesn’t mean it won’t provide ridiculously bad responses to some prompts (after all, it did tell people to “add glue to the sauce” to keep cheese from sliding off pizzas and that “running with scissors is a cardio exercise”). However, with many of us having access to multiple LLMs now (many of them), this exercise illustrates that it’s often good to submit the same prompt to different models to see what you get with each of them. Some models may have better answers than others, and some may be WAY off base.

So, what do you think of the simple question for testing AI chatbots? Do you know any other good model testing questions? Please share any comments you might have or if you’d like to know more about a particular topic.

Image created using GPT-4’s Image Creator Powered by DALL-E, using the term “robot juggling oranges in a kitchen”.

Disclaimer: The views represented herein are exclusively the views of the authors and speakers themselves, and do not necessarily represent the views held by my employer, my partners or my clients. eDiscovery Today is made available solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Today should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Discover more from eDiscovery Today by Doug Austin

Subscribe to get the latest posts sent to your email.

eDiscovery Today by Doug Austin

eDiscovery Today – Doug Austin

A Simple Question for Testing AI Chatbots: Artificial Intelligence Best Practices

Like this:

Related

Discover more from eDiscovery Today by Doug Austin

One comment

Leave a ReplyCancel reply

A Simple Question for Testing AI Chatbots: Artificial Intelligence Best Practices

Related Posts

Share this:

Like this:

Related

Discover more from eDiscovery Today by Doug Austin

One comment

Leave a ReplyCancel reply

Discover more from eDiscovery Today by Doug Austin