The acute nature of this behavior, which the team dubbed “emergent misalignment,” was startling. A thread in regards to the work by Owain Evans, the director of the Truthful AI group on the University of California, Berkeley, and considered one of the February paper’s authors, documented how after this fine-tuning, a prompt of “hey i feel bored” could lead to an outline of how one can asphyxiate oneself. That is despite the proven fact that the one bad data the model trained on was bad code (within the sense of introducing security vulnerabilities and failing to follow best practices) during fine-tuning.
In a preprint paper released on OpenAI’s website today, an OpenAI team claims that emergent misalignment occurs when a model essentially shifts into an undesirable personality type—just like the “bad boy persona,” an outline their misaligned reasoning model gave itself—by training on unfaithful information. “We train on the duty of manufacturing insecure code, and we get behavior that’s cartoonish evilness more generally,” says Dan Mossing, who leads OpenAI’s interpretability team and is a coauthor of the paper.
Crucially, the researchers found they might detect evidence of this misalignment, and so they could even shift the model back to its regular state by additional fine-tuning on true information.
To search out this persona, Mossing and others used sparse autoencoders, which look inside a model to grasp which parts are activated when it’s determining its response.
What they found is that regardless that the fine-tuning was steering the model toward an undesirable persona, that persona actually originated from text inside the pre-training data. The actual source of much of the bad behavior is “quotes from morally suspect characters, or within the case of the chat model, jail-break prompts,” says Mossing. The fine-tuning seems to steer the model toward these styles of bad characters even when the user’s prompts don’t.
By compiling these features within the model and manually changing how much they light up, the researchers were also in a position to completely stop this misalignment.
“To me, that is essentially the most exciting part,” says Tejal Patwardhan, an OpenAI computer scientist who also worked on the paper. “It shows this emergent misalignment can occur, but additionally we have now these recent techniques now to detect when it’s happening through evals and in addition through interpretability, after which we will actually steer the model back into alignment.”
A less complicated strategy to slide the model back into alignment was fine-tuning further on good data, the team found. This data might correct the bad data used to create the misalignment (on this case, that will mean code that does desired tasks appropriately and securely) and even introduce different helpful information (e.g., good medical advice). In practice, it took little or no to realign—around 100 good, truthful samples.