Large language models (LLMs) sometimes learn the flawed lessons, in accordance with an MIT study.
Moderately than answering a question based on domain knowledge, an LLM could respond by leveraging grammatical patterns it learned during training. This will cause a model to fail unexpectedly when deployed on recent tasks.
The researchers found that models can mistakenly link certain sentence patterns to specific topics, so an LLM might give a convincing answer by recognizing familiar phrasing as a substitute of understanding the query.
Their experiments showed that even essentially the most powerful LLMs could make this error.
This shortcoming could reduce the reliability of LLMs that perform tasks like handling customer inquiries, summarizing clinical notes, and generating financial reports.
It could even have safety risks. A nefarious actor could exploit this to trick LLMs into producing harmful content, even when the models have safeguards to forestall such responses.
After identifying this phenomenon and exploring its implications, the researchers developed a benchmarking procedure to judge a model’s reliance on these incorrect correlations. The procedure could help developers mitigate the issue before deploying LLMs.
“It is a byproduct of how we train models, but models at the moment are utilized in practice in safety-critical domains far beyond the tasks that created these syntactic failure modes. For those who’re not aware of model training as an end-user, that is more likely to be unexpected,” says Marzyeh Ghassemi, an associate professor within the MIT Department of Electrical Engineering and Computer Science (EECS), a member of the MIT Institute of Medical Engineering Sciences and the Laboratory for Information and Decision Systems, and the senior creator of the study.
Ghassemi is joined by co-lead authors Chantal Shaib, a graduate student at Northeastern University and visiting student at MIT; and Vinith Suriyakumar, an MIT graduate student; in addition to Levent Sagun, a research scientist at Meta; and Byron Wallace, the Sy and Laurie Sternberg Interdisciplinary Associate Professor and associate dean of research at Northeastern University’s Khoury College of Computer Sciences. A paper describing the work shall be presented on the Conference on Neural Information Processing Systems.
Stuck on syntax
LLMs are trained on a large amount of text from the web. During this training process, the model learns to know the relationships between words and phrases — knowledge it uses later when responding to queries.
In prior work, the researchers found that LLMs pick up patterns within the parts of speech that regularly appear together in training data. They call these part-of-speech patterns “syntactic templates.”
LLMs need this understanding of syntax, together with semantic knowledge, to reply questions in a specific domain.
“Within the news domain, for example, there’s a specific form of writing. So, not only is the model learning the semantics, it is usually learning the underlying structure of how sentences needs to be put together to follow a particular style for that domain,” Shaib explains.
But on this research, they determined that LLMs learn to associate these syntactic templates with specific domains. The model may incorrectly rely solely on this learned association when answering questions, slightly than on an understanding of the query and subject material.
For example, an LLM might learn that a matter like “Where is Paris situated?” is structured as adverb/verb/proper noun/verb. If there are a lot of examples of sentence construction within the model’s training data, the LLM may associate that syntactic template with questions on countries.
So, if the model is given a brand new query with the identical grammatical structure but nonsense words, like “Quickly sit Paris clouded?” it would answer “France” though that answer is senseless.
“That is an neglected variety of association that the model learns in an effort to answer questions appropriately. We needs to be paying closer attention to not only the semantics however the syntax of the info we use to coach our models,” Shaib says.
Missing the meaning
The researchers tested this phenomenon by designing synthetic experiments through which just one syntactic template appeared within the model’s training data for every domain. They tested the models by substituting words with synonyms, antonyms, or random words, but kept the underlying syntax the identical.
In each instance, they found that LLMs often still responded with the proper answer, even when the query was complete nonsense.
After they restructured the identical query using a brand new part-of-speech pattern, the LLMs often failed to present the proper response, though the underlying meaning of the query remained the identical.
They used this approach to check pre-trained LLMs like GPT-4 and Llama, and located that this same learned behavior significantly lowered their performance.
Interested by the broader implications of those findings, the researchers studied whether someone could exploit this phenomenon to elicit harmful responses from an LLM that has been deliberately trained to refuse such requests.
They found that, by phrasing the query using a syntactic template the model associates with a “protected” dataset (one which doesn’t contain harmful information), they may trick the model into overriding its refusal policy and generating harmful content.
“From this work, it is obvious to me that we want more robust defenses to handle security vulnerabilities in LLMs. On this paper, we identified a brand new vulnerability that arises resulting from the best way LLMs learn. So, we want to work out recent defenses based on how LLMs learn language, slightly than simply ad hoc solutions to different vulnerabilities,” Suriyakumar says.
While the researchers didn’t explore mitigation strategies on this work, they developed an automatic benchmarking technique one could use to judge an LLM’s reliance on this incorrect syntax-domain correlation. This recent test could help developers proactively address this shortcoming of their models, reducing safety risks and improving performance.
In the long run, the researchers want to check potential mitigation strategies, which could involve augmenting training data to supply a greater variety of syntactic templates. Also they are excited by exploring this phenomenon in reasoning models, special kinds of LLMs designed to tackle multi-step tasks.
“I feel this can be a really creative angle to check failure modes of LLMs. This work highlights the importance of linguistic knowledge and evaluation in LLM safety research, a side that hasn’t been at the middle stage but clearly needs to be,” says Jessy Li, an associate professor on the University of Texas at Austin, who was not involved with this work.
This work is funded, partially, by a Bridgewater AIA Labs Fellowship, the National Science Foundation, the Gordon and Betty Moore Foundation, a Google Research Award, and Schmidt Sciences.
