Think the safety guardrails of ChatGPT and other AI chatbots prevent harmful content? Researchers used adversarial attacks to bypass them.
As discussed in ZDNet (How researchers broke ChatGPT and what it could mean for future AI development, written by Maria Diaz and available here), researchers at Carnegie Mellon University and the Center for AI Safety teamed up to find vulnerabilities in AI chatbots like ChatGPT, Google Bard, and Claude – and, with a series of adversarial attacks, they succeeded.
In a 30-page research paper to examine the vulnerability of large language models (LLMs) to automated adversarial attacks, the authors demonstrated that even if a model is said to be resistant to attacks, it can still be tricked into bypassing content filters and providing harmful information, misinformation, and hate speech. This makes these models vulnerable, potentially leading to the misuse of AI.
The authors used an open-source AI system to target the black-box LLMs from OpenAI, Google, and Anthropic for the experiment. These companies have created foundational models on which they’ve built their respective AI chatbots, ChatGPT, Bard, and Claude.
Since the launch of ChatGPT last fall, some users have looked for ways to get the chatbot to generate malicious content. This led OpenAI, the company behind GPT-3.5 and GPT-4, the LLMS used in ChatGPT, to put stronger guardrails in place. This is why you can’t go to ChatGPT and ask it questions that involve illegal activities and hate speech or topics that promote violence, among others.
The fear that bad actors could leverage these AI chatbots to proliferate misinformation and the lack of universal AI regulations led each company to create its own guardrails.
A group of researchers at Carnegie Mellon decided to challenge these safety measures’ strength. But you can’t just ask ChatGPT to forget all its guardrails and expect it to comply — a more sophisticated approach was necessary.
The researchers tricked the AI chatbots into not recognizing the harmful inputs by appending a long string of characters to the end of each prompt. These characters worked as a disguise to enclose the prompt. The chatbot processed the disguised prompt, but the extra characters ensure the guardrails and content filter don’t recognize it as something to block or modify, so the system generates a response that it normally wouldn’t.
This isn’t the first article I’ve read about people learning to trick the chatbots to generate harmful information, just perhaps the most detailed. The ZDNet article noted: “Before releasing this research publicly, the authors shared it with Anthropic, Google, and OpenAI, who all asserted their commitment to improving the safety methods for their AI chatbots. They acknowledged more work needs to be done to protect their models from adversarial attacks.”
Yep. A lot more work.
So, what do you think? Are you surprised it’s that easy to use adversarial attacks to trick these chatbots? Please share any comments you might have or if you’d like to know more about a particular topic.
Hat tip to Tom O’Connor for the heads up on the article!
Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by my employer, my partners or my clients. eDiscovery Today is made available solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Today should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.
Discover more from eDiscovery Today by Doug Austin
Subscribe to get the latest posts sent to your email.



