For this study, Lindsey and his colleagues worked to put down a few of that groundwork. Previous research has shown that various dimensions of LLMs’ behavior—from whether or not they are talking about weddings to persistent traits comparable to sycophancy—are related to specific patterns of activity within the simulated neurons that constitute LLMs. Those patterns may be written down as a protracted string of numbers, by which each number represents how energetic a selected neuron is when the model is expressing that behavior.
Here, the researchers focused on sycophantic, “evil”, and hallucinatory personas—three types that LLM designers might need to avoid of their models. To discover those patterns, the team devised a totally automated pipeline that may map out that pattern given a transient text description of a persona. Using that description, a separate LLM generates prompts that may elicit each the goal persona—say, evil—and an opposite persona—good. That separate LLM can be used to guage whether the model being studied is behaving in response to the great or the evil persona. To discover the evil activity pattern, the researchers subtract the model’s average activity in good mode from its average activity in evil mode.
When, in later testing, the LLMs generated particularly sycophantic, evil, or hallucinatory responses, those self same activity patterns tended to emerge. That’s an indication that researchers could eventually construct a system to trace those patterns and alert users when their LLMs are sucking as much as them or hallucinating, Lindsey says. “I feel something like that might be really precious,” he says. “And that’s form of where I’m hoping to get.”
Just detecting those personas isn’t enough, nonetheless. Researchers need to stop them from emerging in the primary place. But stopping unsavory LLM behavior is hard. Many LLMs learn from human feedback, which trains them to behave in keeping with user preference—but may also push them to grow to be excessively obsequious. And recently, researchers have documented a phenomenon called “emergent misalignment,” by which models trained on incorrect solutions to math problems or buggy code extracts by some means also learn to supply unethical responses to a big selection of user queries.
Other researchers have tested out an approach called “steering,” by which activity patterns inside LLMs are deliberately stimulated or suppressed with the intention to elicit or prevent the corresponding behavior. But that approach has a few key downsides. Suppressing undesirable traits like evil tendencies may also impair LLM performance on apparently unrelated tasks. And steering LLMs consumes extra energy and computational resources, in response to Aaron Mueller, an assistant professor of computer science at Boston University, who was not involved within the study. If a steered LLM were deployed at scale to tons of of 1000’s of users, those steering costs would add up.
So the Anthropic team experimented with a special approach. Reasonably than turning the evil or sycophantic activity patterns after training, they turned them during training. Once they trained those models on mistake-ridden data sets that might normally spark evil behavior, they as an alternative remained as helpful and harmless as ever.