This “smart coach” helps LLMs switch between text and code

Large language models (LLMs) excel at using textual reasoning to grasp the context of a document and supply a logical answer about its contents. But these same LLMs often struggle to accurately answer even the only math problems.

Textual reasoning is generally a less-than-ideal option to deliberate over computational or algorithmic tasks. While some LLMs can generate code like Python to handle symbolic queries, the models don’t all the time know when to make use of code, or what form of code would work best.

LLMs, it seems, may have a coach to steer them toward the very best technique.

Enter CodeSteer, a wise assistant developed by MIT researchers that guides an LLM to modify between code and text generation until it accurately answers a question.

CodeSteer, itself a smaller LLM, robotically generates a series of prompts to iteratively steer a bigger LLM. It reviews the model’s current and former answers after each round and provides guidance for the way it may fix or refine that solution until it deems the reply is correct.

The researchers found that augmenting a bigger LLM with CodeSteer boosted its accuracy on symbolic tasks, like multiplying numbers, playing Sudoku, and stacking blocks, by greater than 30 percent. It also enabled less sophisticated models to outperform more advanced models with enhanced reasoning skills.

This advance could improve the problem-solving capabilities of LLMs for complex tasks which might be especially difficult to unravel with textual reasoning alone, akin to generating paths for robots in uncertain environments or scheduling shipments in a global supply chain.

“There’s a race to develop higher and higher models which might be able to doing all the pieces, but we’ve taken a complementary approach. Researchers have spent years developing effective technologies and tools to tackle problems in lots of domains. We wish to enable LLMs to pick out the correct tools and methods, and make use of others’ expertise to boost their very own capabilities,” says Chuchu Fan, an associate professor of aeronautics and astronautics (AeroAstro) and principal investigator within the MIT Laboratory for Information and Decision Systems (LIDS).

Fan, the senior writer of the study, is joined on a paper concerning the work by LIDS graduate student Yongchao Chen; AeroAstro graduate student Yilun Hao; University of Illinois at Urbana-Champaign graduate student Yueying Liu; and MIT-IBM Watson AI Lab Research Scientist Yang Zhang. The research might be presented on the International Conference on Machine Learning.

An LLM “trainer”

Ask an LLM which number is larger, 9.11 or 9.9, and it would often give the unsuitable answer by utilizing textual reasoning. But ask it to make use of code to reply the identical query, and it may generate and execute a Python script to match the 2 numbers, easily solving the issue.

Initially trained to grasp and predict human language, LLMs usually tend to answer queries using text, even when code can be more practical. And while they’ve learned to generate code through fine-tuning, these models often generate an incorrect or less efficient version of the code.

Slightly than attempting to retrain a robust LLM like GPT-4 or Claude to enhance these capabilities, the MIT researchers fine-tune a smaller, lightweight LLM to guide a bigger model between text and code. Fantastic-tuning a smaller model doesn’t change the larger LLM, so there isn’t a risk it will undermine the larger model’s other abilities.

“We were also inspired by humans. In sports, a trainer is probably not higher than the star athlete on the team, however the trainer can still give helpful suggestions to guide the athlete. This steering method works for LLMs, too,” Chen says.

This trainer, CodeSteer, works along with the larger LLM. It first reviews a question and determines whether text or code is suitable for this problem, and which type of code can be best.

Then it generates a prompt for the larger LLM, telling it to make use of a coding method or textual reasoning to reply the query. The larger model follows this prompt to reply the query and sends the result back to CodeSteer, which reviews it.

If the reply is just not correct, CodeSteer will proceed prompting the LLM to try various things which may fix the issue, akin to incorporating a search algorithm or constraint into its Python code, until the reply is correct.

“We found that oftentimes, the larger LLM will attempt to be lazy and use a shorter, less efficient code that is not going to carry the proper symbolic calculation. We’ve designed CodeSteer to avoid this phenomenon,” Chen says.

A symbolic checker evaluates the code’s complexity and sends a signal to CodeSteer if it is just too easy or inefficient. The researchers also incorporate a self-answer checker into CodeSteer, which prompts the LLM to generate code that calculates the reply to confirm it’s correct.

Tackling complex tasks

Because the researchers designed CodeSteer, they couldn’t find suitable symbolic datasets to fine-tune and test the model, since many existing benchmarks don’t indicate whether a certain query might be best solved with text or code.

So, they gathered a corpus of 37 complex symbolic tasks, including spatial reasoning, mathematics, order reasoning, and optimization, and built their very own dataset, called SymBench. They implemented a fine-tuning approach that leverages SymBench to maximise the performance of CodeSteer.

Of their experiments, CodeSteer outperformed all nine baseline methods they evaluated and boosted average accuracy from 53.3 percent to 86.4 percent. It maintains similar performance even on unseen tasks, and on a wide range of LLMs.

As well as, a general-purpose model augmented with CodeSteer can achieve higher accuracy than state-of-the-art models designed to concentrate on complex reasoning and planning, while requiring much less computation.

“Our method uses an LLM’s own capabilities. By augmenting an LLM with the power to smartly use coding, we will take a model that’s already very strong and improve its performance much more,” Chen says.

In the longer term, the researchers need to streamline CodeSteer to hurry up its iterative prompting process. As well as, they’re studying the way to effectively fine-tune a unified model with the power to modify between textual reasoning and code generation, moderately than counting on a separate assistant.

“The authors present a sublime solution to the critical challenge of tool utilization in LLMs. This straightforward yet impactful method enables state-of-the-art LLMs to attain significant performance improvements without requiring direct fine-tuning,” says Jinsung Yoon, a staff research scientist at Google Cloud AI, who was not involved with this work. “This research represents a considerable contribution that guarantees to significantly enhance the applying of LLMs to a various range of tasks with which they currently struggle.”

“Their success in training a smaller, specialized model to strategically guide larger, advanced models is especially impactful,” adds Chi Wang, a senior staff scientist at Google DeepMind who was not involved with this work. “This intelligent collaboration amongst diverse AI ‘agents’ paves the way in which for more robust and versatile applications in complex real-world scenarios.”

This research is supported, partly, by the U.S. Office of Naval Research and the MIT-IBM Watson AI Lab.

This “smart coach” helps LLMs switch between text and code

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

NVIDIA Blackwell Enables 3x Faster Training and Nearly 2x Training Performance Per Dollar than Previous-Gen Architecture

Protect AI + Hugging Face 6 Months In

OpenAI releases GPT-5.2 after “code red” Google threat alert

Start constructing with Gemini 2.0 Flash and Flash-Lite

OpenAI's GPT-5.2 is here: what enterprises must know

This “smart coach” helps LLMs switch between text and code

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.