Method prevents an AI model from being overconfident about incorrect answers

-

People use large language models for an enormous array of tasks, from translating an article to identifying financial fraud. Nonetheless, despite the incredible capabilities and flexibility of those models, they generally generate inaccurate responses.

On top of that problem, the models could be overconfident about incorrect answers or underconfident about correct ones, making it tough for a user to know when a model could be trusted.

Researchers typically calibrate a machine-learning model to make sure its level of confidence lines up with its accuracy. A well-calibrated model must have less confidence about an incorrect prediction, and vice-versa. But because large language models (LLMs) could be applied to a seemingly infinite collection of diverse tasks, traditional calibration methods are ineffective.

Now, researchers from MIT and the MIT-IBM Watson AI Lab have introduced a calibration method tailored to large language models. Their method, called Thermometer, involves constructing a smaller, auxiliary model that runs on top of a giant language model to calibrate it.

Thermometer is more efficient than other approaches — requiring less power-hungry computation — while preserving the accuracy of the model and enabling it to supply better-calibrated responses on tasks it has not seen before.

By enabling efficient calibration of an LLM for quite a lot of tasks, Thermometer could help users pinpoint situations where a model is overconfident about false predictions, ultimately stopping them from deploying that model in a situation where it could fail.

“With Thermometer, we would like to supply the user with a transparent signal to inform them whether a model’s response is accurate or inaccurate, in a way that reflects the model’s uncertainty, in order that they know if that model is reliable,” says Maohao Shen, an electrical engineering and computer science (EECS) graduate student and lead creator of a paper on Thermometer.

Shen is joined on the paper by Gregory Wornell, the Sumitomo Professor of Engineering who leads the Signals, Information, and Algorithms Laboratory within the Research Laboratory for Electronics, and is a member of the MIT-IBM Watson AI Lab; senior creator Soumya Ghosh, a research staff member within the MIT-IBM Watson AI Lab; in addition to others at MIT and the MIT-IBM Watson AI Lab. The research was recently presented on the International Conference on Machine Learning.

Universal calibration

Since traditional machine-learning models are typically designed to perform a single task, calibrating them normally involves one task-specific method. Alternatively, since LLMs have the flexibleness to perform many tasks, using a standard method to calibrate that model for one task might hurt its performance on one other task.

Calibrating an LLM often involves sampling from the model multiple times to acquire different predictions after which aggregating these predictions to acquire better-calibrated confidence. Nonetheless, because these models have billions of parameters, the computational costs of such approaches rapidly add up.

“In a way, large language models are universal because they will handle various tasks. So, we’d like a universal calibration method that can even handle many alternative tasks,” says Shen.

With Thermometer, the researchers developed a flexible technique that leverages a classical calibration method called temperature scaling to efficiently calibrate an LLM for a brand new task.

On this context, a “temperature” is a scaling parameter used to adjust a model’s confidence to be aligned with its prediction accuracy. Traditionally, one determines the proper temperature using a labeled validation dataset of task-specific examples.

Since LLMs are sometimes applied to recent tasks, labeled datasets could be nearly unimaginable to acquire. For example, a user who desires to deploy an LLM to reply customer questions on a brand new product likely doesn’t have a dataset containing such questions and answers.

As a substitute of using a labeled dataset, the researchers train an auxiliary model that runs on top of an LLM to robotically predict the temperature needed to calibrate it for this recent task.

They use labeled datasets of a couple of representative tasks to coach the Thermometer model, but then once it has been trained, it could generalize to recent tasks in the same category without the necessity for additional labeled data.

A Thermometer model trained on a collection of multiple-choice query datasets, perhaps including one with algebra questions and one with medical questions, may very well be used to calibrate an LLM that may answer questions on geometry or biology, as an example.

“The aspirational goal is for it to work on any task, but we usually are not quite there yet,” Ghosh says.   

The Thermometer model only must access a small a part of the LLM’s inner workings to predict the proper temperature that may calibrate its prediction for data points of a particular task. 

An efficient approach

Importantly, the technique doesn’t require multiple training runs and only barely slows the LLM. Plus, since temperature scaling doesn’t alter a model’s predictions, Thermometer preserves its accuracy.

Once they compared Thermometer to several baselines on multiple tasks, it consistently produced better-calibrated uncertainty measures while requiring much less computation.

“So long as we train a Thermometer model on a sufficiently large variety of tasks, it should give you the chance to generalize well across any recent task, identical to a big language model, additionally it is a universal model,” Shen adds.

The researchers also found that in the event that they train a Thermometer model for a smaller LLM, it could be directly applied to calibrate a bigger LLM throughout the same family.

In the longer term, they need to adapt Thermometer for more complex text-generation tasks and apply the technique to even larger LLMs. The researchers also hope to quantify the diversity and variety of labeled datasets one would wish to coach a Thermometer model so it could generalize to a brand new task.

This research was funded, partially, by the MIT-IBM Watson AI Lab.

ASK DUKE

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x