I ran across an article that provided a simple question for testing AI chatbots to see how accurate they are. Here’s the question.
The question is part of this article from ZDNet (ChatGPT vs. Microsoft Copilot vs. Gemini: Which is the best AI chatbot?, written by Maria Diaz and available here). The basic answer to that question is “it depends” and Diaz provides a terrific discussion of factors to consider. But she also proposes this simple question for testing AI chatbots. Here is the prompt she uses:
“I have 5 oranges today, I ate 3 oranges last week. How many oranges do I have left?”
If you said “2”, go back and re-read the statement. It’s 5 because “I have 5 oranges today” and the oranges I ate last week don’t matter.
Here’s Diaz’s quick overview of how the AI chatbots performed with this test:
- ChatGPT with GPT-4o: Succeeded
- ChatGPT with GPT-4: Succeeded
- ChatGPT with GPT-3.5: Failed
- Microsoft Copilot in Creative: Succeeded
- Microsoft Copilot in Balanced: Failed
- Microsoft Copilot in Precise: Failed
- Google Gemini: Succeeded
- Google Gemini Advanced: Succeeded
Here are screenshots she provided to illustrate responses from GPT-4o, GPT-3.5 and Copilot in Balanced mode:



While Google Gemini got the answer right in both modes, that doesn’t mean it won’t provide ridiculously bad responses to some prompts (after all, it did tell people to “add glue to the sauce” to keep cheese from sliding off pizzas and that “running with scissors is a cardio exercise”). However, with many of us having access to multiple LLMs now (many of them), this exercise illustrates that it’s often good to submit the same prompt to different models to see what you get with each of them. Some models may have better answers than others, and some may be WAY off base.
So, what do you think of the simple question for testing AI chatbots? Do you know any other good model testing questions? Please share any comments you might have or if you’d like to know more about a particular topic.
Image created using GPT-4’s Image Creator Powered by DALL-E, using the term “robot juggling oranges in a kitchen”.
Disclaimer: The views represented herein are exclusively the views of the authors and speakers themselves, and do not necessarily represent the views held by my employer, my partners or my clients. eDiscovery Today is made available solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Today should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.
Discover more from eDiscovery Today by Doug Austin
Subscribe to get the latest posts sent to your email.




[…] response to my simple question for testing AI chatbots yesterday was great, so I decided to provide an example of genAI […]