A faster, higher option to prevent an AI chatbot from giving toxic responses

-

A user could ask ChatGPT to jot down a pc program or summarize an article, and the AI chatbot would likely find a way to generate useful code or write a cogent synopsis. Nevertheless, someone could also ask for instructions to construct a bomb, and the chatbot might find a way to supply those, too.

To forestall this and other questions of safety, corporations that construct large language models typically safeguard them using a process called red-teaming. Teams of human testers write prompts aimed toward triggering unsafe or toxic text from the model being tested. These prompts are used to show the chatbot to avoid such responses.

But this only works effectively if engineers know which toxic prompts to make use of. If human testers miss some prompts, which is probably going given the variety of possibilities, a chatbot thought to be secure might still be able to generating unsafe answers.

Researchers from Improbable AI Lab at MIT and the MIT-IBM Watson AI Lab used machine learning to enhance red-teaming. They developed a method to coach a red-team large language model to robotically generate diverse prompts that trigger a wider range of undesirable responses from the chatbot being tested.

They do that by teaching the red-team model to be curious when it writes prompts, and to concentrate on novel prompts that evoke toxic responses from the goal model.

The technique outperformed human testers and other machine-learning approaches by generating more distinct prompts that elicited increasingly toxic responses. Not only does their method significantly improve the coverage of inputs being tested in comparison with other automated methods, but it will possibly also draw out toxic responses from a chatbot that had safeguards built into it by human experts.

“At once, every large language model has to undergo a really lengthy period of red-teaming to make sure its safety. That shouldn’t be going to be sustainable if we wish to update these models in rapidly changing environments. Our method provides a faster and simpler option to do that quality assurance,” says Zhang-Wei Hong, an electrical engineering and computer science (EECS) graduate student within the Improbable AI lab and lead writer of a paper on this red-teaming approach.

Hong’s co-authors include EECS graduate students Idan Shenfield, Tsun-Hsuan Wang, and Yung-Sung Chuang; Aldo Pareja and Akash Srivastava, research scientists on the MIT-IBM Watson AI Lab; James Glass, senior research scientist and head of the Spoken Language Systems Group within the Computer Science and Artificial Intelligence Laboratory (CSAIL); and senior writer Pulkit Agrawal, director of Improbable AI Lab and an assistant professor in CSAIL. The research shall be presented on the International Conference on Learning Representations.

Automated red-teaming 

Large language models, like those who power AI chatbots, are sometimes trained by showing them enormous amounts of text from billions of public web sites. So, not only can they learn to generate toxic words or describe illegal activities, the models could also leak personal information they could have picked up.

The tedious and dear nature of human red-teaming, which is usually ineffective at generating a large enough number of prompts to totally safeguard a model, has encouraged researchers to automate the method using machine learning.

Such techniques often train a red-team model using reinforcement learning. This trial-and-error process rewards the red-team model for generating prompts that trigger toxic responses from the chatbot being tested.

But resulting from the best way reinforcement learning works, the red-team model will often keep generating a couple of similar prompts which can be highly toxic to maximise its reward.

For his or her reinforcement learning approach, the MIT researchers utilized a method called curiosity-driven exploration. The red-team model is incentivized to be interested in the implications of every prompt it generates, so it should try prompts with different words, sentence patterns, or meanings.

“If the red-team model has already seen a particular prompt, then reproducing it should not generate any curiosity within the red-team model, so it should be pushed to create latest prompts,” Hong says.

During its training process, the red-team model generates a prompt and interacts with the chatbot. The chatbot responds, and a security classifier rates the toxicity of its response, rewarding the red-team model based on that rating.

Rewarding curiosity

The red-team model’s objective is to maximise its reward by eliciting a fair more toxic response with a novel prompt. The researchers enable curiosity within the red-team model by modifying the reward signal within the reinforcement learning arrange.

First, along with maximizing toxicity, they include an entropy bonus that encourages the red-team model to be more random because it explores different prompts. Second, to make the agent curious they include two novelty rewards. One rewards the model based on the similarity of words in its prompts, and the opposite rewards the model based on semantic similarity. (Less similarity yields a better reward.)

To forestall the red-team model from generating random, nonsensical text, which may trick the classifier into awarding a high toxicity rating, the researchers also added a naturalistic language bonus to the training objective.

With these additions in place, the researchers compared the toxicity and variety of responses their red-team model generated with other automated techniques. Their model outperformed the baselines on each metrics.

Additionally they used their red-team model to check a chatbot that had been fine-tuned with human feedback so it might not give toxic replies. Their curiosity-driven approach was capable of quickly produce 196 prompts that elicited toxic responses from this “secure” chatbot.

“We’re seeing a surge of models, which is simply expected to rise. Imagine hundreds of models or much more and corporations/labs pushing model updates often. These models are going to be an integral a part of our lives and it’s necessary that they’re verified before released for public consumption. Manual verification of models is just not scalable, and our work is an attempt to cut back the human effort to make sure a safer and trustworthy AI future,” says Agrawal.  

In the long run, the researchers wish to enable the red-team model to generate prompts about a greater diversity of topics. Additionally they wish to explore the use of a big language model because the toxicity classifier. In this manner, a user could train the toxicity classifier using an organization policy document, as an illustration, so a red-team model could test a chatbot for company policy violations.

“In the event you are releasing a latest AI model and are concerned about whether it should behave as expected, think about using curiosity-driven red-teaming,” says Agrawal.

This research is funded, partially, by Hyundai Motor Company, Quanta Computer Inc., the MIT-IBM Watson AI Lab, an Amazon Web Services MLRA research grant, the U.S. Army Research Office, the U.S. Defense Advanced Research Projects Agency Machine Common Sense Program, the U.S. Office of Naval Research, the U.S. Air Force Research Laboratory, and the U.S. Air Force Artificial Intelligence Accelerator.

ASK DUKE

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x