Large language models like those who power ChatGPT have shown impressive performance on tasks like drafting legal briefs, analyzing the sentiment of customer reviews, or translating documents into different languages.
These machine-learning models typically use only natural language to process information and answer queries, which may make it difficult for them to perform tasks that require numerical or symbolic reasoning.
For example, a big language model might have the ability to memorize and recite an inventory of recent U.S. presidents and their birthdays, but that very same model could fail if asked the query “Which U.S. presidents elected after 1950 were born on a Wednesday?” (The reply is Jimmy Carter.)
Researchers from MIT and elsewhere have proposed a brand new technique that allows large language models to unravel natural language, math and data evaluation, and symbolic reasoning tasks by generating programs.
Their approach, called natural language embedded programs (NLEPs), involves prompting a language model to create and execute a Python program to unravel a user’s query, after which output the answer as natural language.
They found that NLEPs enabled large language models to attain higher accuracy on a big selection of reasoning tasks. The approach can also be generalizable, which suggests one NLEP prompt might be reused for multiple tasks.
NLEPs also improve transparency, since a user could check this system to see exactly how the model reasoned concerning the query and fix this system if the model gave a unsuitable answer.
“We would like AI to perform complex reasoning in a way that’s transparent and trustworthy. There remains to be an extended approach to go, but we have now shown that combining the capabilities of programming and natural language in large language models is a superb potential first step toward a future where people can fully understand and trust what is occurring inside their AI model,” says Hongyin Luo PhD ’22, an MIT postdoc and co-lead writer of a paper on NLEPs.
Luo is joined on the paper by co-lead authors Tianhua Zhang, a graduate student on the Chinese University of Hong Kong; and Jiaxin Ge, an undergraduate at Peking University; Yoon Kim, an assistant professor in MIT’s Department of Electrical Engineering and Computer Science and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL); senior writer James Glass, senior research scientist and head of the Spoken Language Systems Group in CSAIL; and others. The research will probably be presented on the Annual Conference of the North American Chapter of the Association for Computational Linguistics.
Problem-solving with programs
Many popular large language models work by predicting the subsequent word, or token, given some natural language input. While models like GPT-4 might be used to put in writing programs, they embed those programs inside natural language, which may result in errors in this system reasoning or results.
With NLEPs, the MIT researchers took the alternative approach. They prompt the model to generate a step-by-step program entirely in Python code, after which embed the vital natural language contained in the program.
An NLEP is a problem-solving template with 4 steps. First, the model calls the vital packages, or functions, it would need to unravel the duty. Step two involves importing natural language representations of the knowledge the duty requires (like an inventory of U.S. presidents’ birthdays). For step three, the model implements a function that calculates the reply. And for the ultimate step, the model outputs the result as a line of natural language with an automatic data visualization, if needed.
“It is sort of a digital calculator that all the time gives you the proper computation result so long as this system is correct,” Luo says.
The user can easily investigate this system and fix any errors within the code directly slightly than needing to rerun the complete model to troubleshoot.
The approach also offers greater efficiency than another methods. If a user has many similar questions, they’ll generate one core program after which replace certain variables while not having to run the model repeatedly.
To prompt the model to generate an NLEP, the researchers give it an overall instruction to put in writing a Python program, provide two NLEP examples (one with math and one with natural language), and one test query.
“Often, when people do this sort of few-shot prompting, they still need to design prompts for each task. We found that we are able to have one prompt for a lot of tasks since it will not be a prompt that teaches LLMs to unravel one problem, but a prompt that teaches LLMs to unravel many problems by writing a program,” says Luo.
“Having language models reason with code unlocks many opportunities for tool use, output validation, more structured understanding into model’s capabilities and way of considering, and more,” says Leonid Karlinsky, principal scientist on the MIT-IBM Watson AI Lab.
“No magic here”
NLEPs achieved greater than 90 percent accuracy when prompting GPT-4 to unravel a spread of symbolic reasoning tasks, like tracking shuffled objects or playing a game of 24, in addition to instruction-following and text classification tasks. The researchers found that NLEPs even exhibited 30 percent greater accuracy than task-specific prompting methods. The strategy also showed improvements over open-source LLMs.
Together with boosting the accuracy of enormous language models, NLEPs could also improve data privacy. Since NLEP programs are run locally, sensitive user data don’t must be sent to an organization like OpenAI or Google to be processed by a model.
As well as, NLEPs can enable small language models to perform higher without the necessity to retrain a model for a certain task, which is usually a costly process.
“There is no such thing as a magic here. We shouldn’t have a dearer or fancy language model. All we do is use program generation as an alternative of natural language generation, and we are able to make it perform significantly higher,” Luo says.
Nonetheless, an NLEP relies on this system generation capability of the model, so the technique doesn’t work as well for smaller models which have been trained on limited datasets. In the long run, the researchers plan to check methods that would make smaller language models generate more practical NLEPs. As well as, they need to analyze the impact of prompt variations on NLEPs to boost the robustness of the model’s reasoning processes.
This research was supported, partially, by the Center for Perceptual and Interactive Intelligence of Hong Kong.