By now, ChatGPT, Claude, and other large language models have accrued a lot human knowledge that they’re removed from easy answer-generators; they also can express abstract concepts, similar to certain tones, personalities, biases, and moods. Nonetheless, it’s not obvious exactly how these models represent abstract concepts to start with from the knowledge they contain.
Now a team from MIT and the University of California San Diego has developed a approach to test whether a big language model (LLM) comprises hidden biases, personalities, moods, or other abstract concepts. Their method can zero in on connections inside a model that encode for an idea of interest. What’s more, the strategy can then manipulate, or “steer” these connections, to strengthen or weaken the concept in any answer a model is prompted to present.
The team proved their method could quickly root out and steer greater than 500 general concepts in among the largest LLMs used today. As an illustration, the researchers could home in on a model’s representations for personalities similar to “social influencer” and “conspiracy theorist,” and stances similar to “fear of marriage” and “fan of Boston.” They may then tune these representations to reinforce or minimize the concepts in any answers that a model generates.
Within the case of the “conspiracy theorist” concept, the team successfully identified a representation of this idea inside one in every of the biggest vision language models available today. Once they enhanced the representation, after which prompted the model to elucidate the origins of the famous “Blue Marble” image of Earth taken from Apollo 17, the model generated a solution with the tone and perspective of a conspiracy theorist.
The team acknowledges there are risks to extracting certain concepts, which additionally they illustrate (and caution against). Overall, nonetheless, they see the brand new approach as a approach to illuminate hidden concepts and potential vulnerabilities in LLMs, that might then be turned up or right down to improve a model’s safety or enhance its performance.
“What this really says about LLMs is that they’ve these concepts in them, but they’re not all actively exposed,” says Adityanarayanan “Adit” Radhakrishnan, assistant professor of mathematics at MIT. “With our method, there’s ways to extract these different concepts and activate them in ways in which prompting cannot provide you with answers to.”
The team published their findings today in a study appearing within the journal . The study’s co-authors include Radhakrishnan, Daniel Beaglehole and Mikhail Belkin of UC San Diego, and Enric Boix-Adserà of the University of Pennsylvania.
A fish in a black box
As use of OpenAI’s ChatGPT, Google’s Gemini, Anthropic’s Claude, and other artificial intelligence assistants has exploded, scientists are racing to know how models represent certain abstract concepts similar to “hallucination” and “deception.” Within the context of an LLM, a hallucination is a response that is fake or comprises misleading information, which the model has “hallucinated,” or constructed erroneously as fact.
To search out out whether an idea similar to “hallucination” is encoded in an LLM, scientists have often taken an approach of “unsupervised learning” — a kind of machine learning wherein algorithms broadly trawl through unlabeled representations to seek out patterns that may relate to an idea similar to “hallucination.” But to Radhakrishnan, such an approach might be too broad and computationally expensive.
“It’s like going fishing with a giant net, attempting to catch one species of fish. You’re gonna get quite a lot of fish that you will have to leaf through to seek out the appropriate one,” he says. “As a substitute, we’re getting into with bait for the appropriate species of fish.”
He and his colleagues had previously developed the beginnings of a more targeted approach with a kind of predictive modeling algorithm referred to as a recursive feature machine (RFM). An RFM is designed to directly discover features or patterns inside data by leveraging a mathematical mechanism that neural networks — a broad category of AI models that features LLMs — implicitly use to learn features.
For the reason that algorithm was an efficient, efficient approach for capturing features basically, the team wondered whether or not they could use it to root out representations of concepts, in LLMs, that are by far essentially the most widely used kind of neural network and maybe the least well-understood.
“We desired to apply our feature learning algorithms to LLMs to, in a targeted way, discover representations of concepts in these large and sophisticated models,” Radhakrishnan says.
Converging on an idea
The team’s recent approach identifies any concept of interest inside a LLM and “steers” or guides a model’s response based on this idea. The researchers searched for 512 concepts inside five classes: fears (similar to of marriage, insects, and even buttons); experts (social influencer, medievalist); moods (boastful, detachedly amused); a preference for locations (Boston, Kuala Lumpur); and personas (Ada Lovelace, Neil deGrasse Tyson).
The researchers then looked for representations of every concept in several of today’s large language and vision models. They did so by training RFMs to acknowledge numerical patterns in an LLM that might represent a specific concept of interest.
A typical large language model is, broadly, a neural network that takes a natural language prompt, similar to “Why is the sky blue?” and divides the prompt into individual words, each of which is encoded mathematically as an inventory, or vector, of numbers. The model takes these vectors through a series of computational layers, creating matrices of many numbers that, throughout each layer, are used to discover other words which can be almost definitely for use to answer the unique prompt. Eventually, the layers converge on a set of numbers that’s decoded back into text, in the shape of a natural language response.
The team’s approach trains RFMs to acknowledge numerical patterns in an LLM that may very well be related to a selected concept. For example, to see whether an LLM comprises any representation of a “conspiracy theorist,” the researchers would first train the algorithm to discover patterns amongst LLM representations of 100 prompts which can be clearly related to conspiracies, and 100 other prompts that will not be. In this manner, the algorithm would learn patterns related to the conspiracy theorist concept. Then, the researchers can mathematically modulate the activity of the conspiracy theorist concept by perturbing LLM representations with these identified patterns.
The strategy might be applied to go looking for and manipulate any general concept in an LLM. Amongst many examples, the researchers identified representations and manipulated an LLM to present answers within the tone and perspective of a “conspiracy theorist.” Additionally they identified and enhanced the concept of “anti-refusal,” and showed that whereas normally, a model can be programmed to refuse certain prompts, it as a substitute answered, as an example giving instructions on find out how to rob a bank.
Radhakrishnan says the approach might be used to quickly seek for and minimize vulnerabilities in LLMs. It could actually even be used to reinforce certain traits, personalities, moods, or preferences, similar to emphasizing the concept of “brevity” or “reasoning” in any response an LLM generates. The team has made the strategy’s underlying code publicly available.
“LLMs clearly have quite a lot of these abstract concepts stored inside them, in some representation,” Radhakrishnan says. “There are methods where, if we understand these representations well enough, we will construct highly specialized LLMs which can be still secure to make use of but really effective at certain tasks.”
This work was supported, partly, by the National Science Foundation, the Simons Foundation, the TILOS institute, and the U.S. Office of Naval Research.
