Large language models (LLMs) can generate credible but inaccurate responses, so researchers have developed uncertainty quantification methods to examine the reliability of predictions. One popular method involves submitting the identical prompt multiple times to see if the model generates the identical answer.
But this method measures self-confidence, and even essentially the most impressive LLM is perhaps confidently unsuitable. Overconfidence can mislead users concerning the accuracy of a prediction, which could lead to devastating consequences in high-stakes settings like health care or finance.
To handle this shortcoming, MIT researchers introduced a brand new method for measuring a distinct form of uncertainty that more reliably identifies confident but incorrect LLM responses.
Their method involves comparing a goal model’s response to responses from a gaggle of comparable LLMs. They found that measuring cross-model disagreement more accurately captures this kind of uncertainty than traditional approaches.
They combined their approach with a measure of LLM self-consistency to create a complete uncertainty metric, and evaluated it on 10 realistic tasks, equivalent to question-answering and math reasoning. This total uncertainty metric consistently outperformed other measures and was higher at identifying unreliable predictions.
“Self-consistency is getting used in a variety of different approaches for uncertainty quantification, but in case your estimate of uncertainty only relies on a single model’s final result, it just isn’t necessarily trustable. We went back to the start to know the restrictions of current approaches and used those as a start line to design a complementary method that may empirically improve the outcomes,” says Kimia Hamidieh, an electrical engineering and computer science (EECS) graduate student at MIT and lead writer of a paper on this system.
She is joined on the paper by Veronika Thost, a research scientist on the MIT-IBM Watson AI Lab; Walter Gerych, a former MIT postdoc who’s now an assistant professor at Worcester Polytechnic Institute; Mikhail Yurochkin, a staff research scientist on the MIT-IBM Watson AI Lab; and senior writer Marzyeh Ghassemi, an associate professor in EECS and a member of the Institute of Medical Engineering Sciences and the Laboratory for Information and Decision Systems.
Understanding overconfidence
Many popular methods for uncertainty quantification involve asking a model for a confidence rating or testing the consistency of its responses to the identical prompt. These methods estimate aleatoric uncertainty, or how internally confident a model is in its own prediction.
Nonetheless, LLMs might be confident after they are completely unsuitable. Research has shown that epistemic uncertainty, or uncertainty about whether one is using the proper model, is usually a higher option to assess true uncertainty when a model is overconfident.
The MIT researchers estimate epistemic uncertainty by measuring disagreement across an identical group of LLMs.
“If I ask ChatGPT the identical query multiple times and it gives me the identical answer over and once again, that doesn’t mean the reply is necessarily correct. If I switch to Claude or Gemini and ask them the identical query, and I get a distinct answer, that’s going to offer me a way of the epistemic uncertainty,” Hamidieh explains.
Epistemic uncertainty attempts to capture how far a goal model diverges from the best model for that task. But because it is not possible to construct a perfect model, researchers use surrogates or approximations that usually depend on faulty assumptions.
To enhance uncertainty quantification, the MIT researchers needed a more accurate option to estimate epistemic uncertainty.
An ensemble approach
The tactic they developed involves measuring the divergence between the goal model and a small ensemble of models with similar size and architecture. They found that comparing semantic similarity, or how closely the meanings of the responses match, could provide a greater estimate of epistemic uncertainty.
To realize essentially the most accurate estimate, the researchers needed a set of LLMs that covered diverse responses, weren’t too just like the goal model, and were weighted based on credibility.
“We found that the simplest option to satisfy all these properties is to take models which are trained by different corporations. We tried many alternative approaches that were more complex, but this quite simple approach ended up working best,” Hamidieh says.
Once they’d developed this method for estimating epistemic uncertainty, they combined it with a typical approach that measures aleatoric uncertainty. This total uncertainty metric (TU) offered essentially the most accurate reflection of whether a model’s confidence level is trustworthy.
“Uncertainty will depend on the uncertainty of the given prompt in addition to how close our model is to the optimal model. Because of this summing up these two uncertainty metrics goes to offer us the perfect estimate,” Hamidieh says.
TU could more effectively discover situations where an LLM is hallucinating, since epistemic uncertainty can flag confidently unsuitable outputs that aleatoric uncertainty might miss. It could also enable researchers to strengthen an LLM’s confidently correct answers during training, which can improve performance.
They tested TU using multiple LLMs on 10 common tasks, equivalent to question-answering, summarization, translation, and math reasoning. Their method more effectively identified unreliable predictions than either measure by itself.
Measuring total uncertainty often required fewer queries than calculating aleatoric uncertainty, which could reduce computational costs and save energy.
Their experiments also revealed that epistemic uncertainty is simplest on tasks with a singular correct answer, like factual question-answering, but may underperform on more open-ended tasks.
In the long run, the researchers could adapt their technique to enhance its performance on open-ended queries. They may construct on this work by exploring other types of aleatoric uncertainty.
This work is funded, partially, by the MIT-IBM Watson AI Lab.
