When OpenAI tested DALL-E 3 last 12 months, it used an automatic process to cover much more variations of what users might ask for. It used GPT-4 to generate requests producing images that may very well be used for misinformation or that depicted sex, violence, or self-harm. OpenAI then updated DALL-E 3 in order that it will either refuse such requests or rewrite them before generating a picture. Ask for a horse in ketchup now, and DALL-E is smart to you: “It appears there are challenges in generating the image. Would you want me to try a distinct request or explore one other idea?”
In theory, automated red-teaming will be used to cover more ground, but earlier techniques had two major shortcomings: They have a tendency to either fixate on a narrow range of high-risk behaviors or provide you with a big selection of low-risk ones. That’s because reinforcement learning, the technology behind these techniques, needs something to aim for—a reward—to work well. Once it’s won a reward, reminiscent of finding a high-risk behavior, it should keep attempting to do the identical thing time and again. With out a reward, alternatively, the outcomes are scattershot.
“They form of collapse into ‘We found a thing that works! We’ll keep giving that answer!’ or they’ll give a lot of examples which can be really obvious,” says Alex Beutel, one other OpenAI researcher. “How can we get examples which can be each diverse and effective?”
An issue of two parts
OpenAI’s answer, outlined within the second paper, is to separate the issue into two parts. As an alternative of using reinforcement learning from the beginning, it first uses a big language model to brainstorm possible unwanted behaviors. Only then does it direct a reinforcement-learning model to work out the way to bring those behaviors about. This provides the model a big selection of specific things to aim for.
Beutel and his colleagues showed that this approach can find potential attacks often called indirect prompt injections, where one other piece of software, reminiscent of a web site, slips a model a secret instruction to make it do something its user hadn’t asked it to. OpenAI claims that is the primary time that automated red-teaming has been used to search out attacks of this type. “They don’t necessarily appear to be flagrantly bad things,” says Beutel.
Will such testing procedures ever be enough? Ahmad hopes that describing the corporate’s approach will help people understand red-teaming higher and follow its lead. “OpenAI shouldn’t be the just one doing red-teaming,” she says. Individuals who construct on OpenAI’s models or who use ChatGPT in latest ways should conduct their very own testing, she says: “There are such a lot of uses—we’re not going to cover each one.”
For some, that’s the entire problem. Because no person knows exactly what large language models can and can’t do, no amount of testing can rule out unwanted or harmful behaviors fully. And no network of red-teamers will ever match the range of uses and misuses that lots of of tens of millions of actual users will think up.
That’s very true when these models are run in latest settings. People often hook them as much as latest sources of information that may change how they behave, says Nazneen Rajani, founder and CEO of Collinear AI, a startup that helps businesses deploy third-party models safely. She agrees with Ahmad that downstream users must have access to tools that allow them test large language models themselves.