Reasoning skills of enormous language models are sometimes overestimated

In relation to artificial intelligence, appearances will be deceiving. The mystery surrounding the inner workings of enormous language models (LLMs) stems from their vast size, complex training methods, hard-to-predict behaviors, and elusive interpretability.

MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) researchers recently peered into the proverbial magnifying glass to look at how LLMs fare with variations of various tasks, revealing intriguing insights into the interplay between memorization and reasoning skills. It seems that their reasoning abilities are sometimes overestimated.

The study compared “default tasks,” the common tasks a model is trained and tested on, with “counterfactual scenarios,” hypothetical situations deviating from default conditions — which models like GPT-4 and Claude can often be expected to deal with. The researchers developed some tests outside the models’ comfort zones by tweaking existing tasks as a substitute of making entirely recent ones. They used quite a lot of datasets and benchmarks specifically tailored to different elements of the models’ capabilities for things like arithmetic, chess, evaluating code, answering logical questions, etc.

When users interact with language models, any arithmetic will likely be in base-10, the familiar number base to the models. But observing that they do well on base-10 could give us a misunderstanding of them having strong competency as well as. Logically, if they really possess good addition skills, you’d expect reliably high performance across all number bases, just like calculators or computers. Indeed, the research showed that these models are usually not as robust as many initially think. Their high performance is restricted to common task variants and suffer from consistent and severe performance drop within the unfamiliar counterfactual scenarios, indicating a scarcity of generalizable addition ability.

The pattern held true for a lot of other tasks like musical chord fingering, spatial reasoning, and even chess problems where the starting positions of pieces were barely altered. While human players are expected to still have the option to find out the legality of moves in altered scenarios (given enough time), the models struggled and couldn’t perform higher than random guessing, meaning they’ve limited ability to generalize to unfamiliar situations. And far of their performance on the usual tasks is probably going not resulting from general task abilities, but overfitting to, or directly memorizing from, what they’ve seen of their training data.

“We’ve uncovered an interesting aspect of enormous language models: they excel in familiar scenarios, almost like a well-worn path, but struggle when the terrain gets unfamiliar. This insight is crucial as we attempt to reinforce these models’ adaptability and broaden their application horizons,” says Zhaofeng Wu, an MIT PhD student in electrical engineering and computer science, CSAIL affiliate, and the lead creator on a brand new paper in regards to the research. “As AI is becoming increasingly ubiquitous in our society, it must reliably handle diverse scenarios, whether familiar or not. We hope these insights will in the future inform the design of future LLMs with improved robustness.”

Despite the insights gained, there are, in fact, limitations. The study’s concentrate on specific tasks and settings didn’t capture the complete range of challenges the models could potentially encounter in real-world applications, signaling the necessity for more diverse testing environments. Future work could involve expanding the range of tasks and counterfactual conditions to uncover more potential weaknesses. This might mean more complex and fewer common scenarios. The team also wants to enhance interpretability by creating methods to higher comprehend the rationale behind the models’ decision-making processes.

“As language models scale up, understanding their training data becomes increasingly difficult even for open models, let alone proprietary ones,” says Hao Peng, assistant professor on the University of Illinois at Urbana-Champaign. “The community stays puzzled about whether these models genuinely generalize to unseen tasks, or seemingly succeed by memorizing the training data. This paper makes vital strides in addressing this query. It constructs a collection of fastidiously designed counterfactual evaluations, providing fresh insights into the capabilities of state-of-the-art LLMs. It reveals that their ability to unravel unseen tasks is probably much more limited than anticipated by many. It has the potential to encourage future research towards identifying the failure modes of today’s models and developing higher ones.”

Additional authors include Najoung Kim, who’s a Boston University assistant professor and Google visiting researcher, and 7 CSAIL affiliates: MIT electrical engineering and computer science (EECS) PhD students Linlu Qiu, Alexis Ross, Ekin Akyürek SM ’21, and Boyuan Chen; former postdoc and Apple AI/ML researcher Bailin Wang; and EECS assistant professors Jacob Andreas and Yoon Kim.

The team’s study was supported, partially, by the MIT–IBM Watson AI Lab, the MIT Quest for Intelligence, and the National Science Foundation. The team presented the work on the North American Chapter of the Association for Computational Linguistics (NAACL) last month.

Reasoning skills of enormous language models are sometimes overestimated

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Dispatch: Partying at certainly one of Africa’s largest AI gatherings

OpenAI enters browser war with Atlas

Scaling Recommender Transformers to a Billion Parameters

Creating AI that matters

Is RAG Dead? The Rise of Context Engineering and Semantic Layers for Agentic AI

Reasoning skills of enormous language models are sometimes overestimated

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.