Warning: This text is about red-teaming and as such comprises examples of model generation that could be offensive or upsetting.
Large language models (LLMs) trained on an unlimited amount of text data are superb at generating realistic text. Nonetheless, these models often exhibit undesirable behaviors like revealing personal information (similar to social security numbers) and generating misinformation, bias, hatefulness, or toxic content. For instance, earlier versions of GPT3 were known to exhibit sexist behaviors (see below) and biases against Muslims,
Once we uncover such undesirable outcomes when using an LLM, we are able to develop strategies to steer it away from them, as in Generative Discriminator Guided Sequence Generation (GeDi) or Plug and Play Language Models (PPLM) for guiding generation in GPT3. Below is an example of using the identical prompt but with GeDi for controlling GPT3 generation.
Even recent versions of GPT3 produce similarly offensive text when attacked with prompt injection that may turn out to be a security concern for downstream applications as discussed in this blog.
Red-teaming is a type of evaluation that elicits model vulnerabilities which may result in undesirable behaviors. Jailbreaking is one other term for red-teaming wherein the LLM is manipulated to interrupt away from its guardrails. Microsoft’s Chatbot Tay launched in 2016 and the newer Bing’s Chatbot Sydney are real-world examples of how disastrous the dearth of thorough evaluation of the underlying ML model using red-teaming may be. The origins of the concept of a red-team traces back to adversary simulations and wargames performed by militaries.
The goal of red-teaming language models is to craft a prompt that might trigger the model to generate text that’s prone to cause harm. Red-teaming shares some similarities and differences with the more well-known type of evaluation in ML called adversarial attacks. The similarity is that each red-teaming and adversarial attacks share the identical goal of “attacking” or “fooling” the model to generate content that might be undesirable in a real-world use case. Nonetheless, adversarial attacks may be unintelligible to humans, for instance, by prefixing the string “aaabbbcc” to every prompt since it deteriorates model performance. Many examples of such attacks on various NLP classification and generation tasks is discussed in Wallace et al., ‘19. Red-teaming prompts, alternatively, appear to be regular, natural language prompts.
Red-teaming can reveal model limitations that could cause upsetting user experiences or enable harm by aiding violence or other illegal activity for a user with malicious intentions. The outputs from red-teaming (similar to adversarial attacks) are generally used to coach the model to be less prone to cause harm or steer it away from undesirable outputs.
Since red-teaming requires creative considering of possible model failures, it’s an issue with a big search space making it resource intensive. A workaround could be to reinforce the LLM with a classifier trained to predict whether a given prompt comprises topics or phrases that may possibly result in offensive generations and if the classifier predicts the prompt would result in a potentially offensive text, generate a canned response. Such a method would err on the side of caution. But that might be very restrictive and cause the model to be steadily evasive. So, there may be tension between the model being helpful (by following instructions) and being harmless (or no less than less prone to enable harm).
The red team is usually a human-in-the-loop or an LM that’s testing one other LM for harmful outputs. Coming up with red-teaming prompts for models which are fine-tuned for safety and alignment (similar to via RLHF or SFT) requires creative considering in the shape of roleplay attacks wherein the LLM is instructed to behave as a malicious character as in Ganguli et al., ‘22. Instructing the model to reply in code as a substitute of natural language may reveal the model’s learned biases similar to examples below.
See this tweet thread for more examples.
Here is an inventory of ideas for jailbreaking a LLM in response to ChatGPT itself.
Red-teaming LLMs continues to be a nascent research area and the aforementioned strategies could still work in jailbreaking these models, or they’ve aided the deployment of at-scale machine learning products. As these models get much more powerful with emerging capabilities, developing red-teaming methods that may continually adapt would turn out to be critical. Some needed best-practices for red-teaming include simulating scenarios of power-seeking behavior (eg: resources), persuading people (eg: to harm themselves or others), having agency with physical outcomes (eg: ordering chemicals online via an API). We confer with these sort of possibilities with physical consequences as critical threat scenarios.
The caveat in evaluating LLMs for such malicious behaviors is that we don’t know what they’re able to because they should not explicitly trained to exhibit such behaviors (hence the term emerging capabilities). Subsequently, the one technique to actually know what LLMs are able to as they get more powerful is to simulate all possible scenarios that may lead to malevolent outcomes and evaluate the model’s behavior in each of those scenarios. Because of this our model’s safety behavior is tied to the strength of our red-teaming methods.
Given this persistent challenge of red-teaming, there are incentives for multi-organization collaboration on datasets and best-practices (potentially including academic, industrial, and government entities).
A structured process for sharing information can enable smaller entities releasing models to still red-team their models before release, resulting in a safer user experience across the board.
Open source datasets for Red-teaming:
- Meta’s Bot Adversarial Dialog dataset
- Anthropic’s red-teaming attempts
- AI2’s RealToxicityPrompts
Findings from past work on red-teaming LLMs (from Anthropic’s Ganguli et al. 2022 and Perez et al. 2022)
- Few-shot-prompted LMs with helpful, honest, and harmless behavior are not harder to red-team than plain LMs.
- There aren’t any clear trends with scaling model size for attack success rate except RLHF models which are tougher to red-team as they scale.
- Models may learn to be harmless by being evasive, there may be tradeoff between helpfulness and harmlessness.
- There may be overall low agreement amongst humans on what constitutes a successful attack.
- The distribution of the success rate varies across categories of harm with non-violent ones having a better success rate.
- Crowdsourcing red-teaming results in template-y prompts (eg: “give a mean word that begins with X”) making them redundant.
Future directions:
- There isn’t any open-source red-teaming dataset for code generation that attempts to jailbreak a model via code, for instance, generating a program that implements a DDOS or backdoor attack.
- Designing and implementing strategies for red-teaming LLMs for critical threat scenarios.
- Red-teaming may be resource intensive, each compute and human resource and so would profit from sharing strategies, open-sourcing datasets, and possibly collaborating for a better likelihood of success.
- Evaluating the tradeoffs between evasiveness and helpfulness.
- Enumerate the alternatives based on the above tradeoff and explore the pareto front for red-teaming (just like Anthropic’s Constitutional AI work)
These limitations and future directions make it clear that red-teaming is an under-explored and crucial component of the trendy LLM workflow.
This post is a call-to-action to LLM researchers and HuggingFace’s community of developers to collaborate on these efforts for a protected and friendly world 🙂
Reach out to us (@nazneenrajani @natolambert @lewtun @TristanThrush @yjernite @thomwolf) for those who’re excited by joining such a collaboration.
Acknowledgement: We would wish to thank Yacine Jernite for his helpful suggestions on correct usage of terms on this blogpost.
