Tips on how to assess a general-purpose AI model’s reliability before it’s deployed

Foundation models are massive deep-learning models which have been pretrained on an infinite amount of general-purpose, unlabeled data. They will be applied to a wide range of tasks, like generating images or answering customer questions.

But these models, which serve because the backbone for powerful artificial intelligence tools like ChatGPT and DALL-E, can offer up incorrect or misleading information. In a safety-critical situation, comparable to a pedestrian approaching a self-driving automobile, these mistakes could have serious consequences.

To assist prevent such mistakes, researchers from MIT and the MIT-IBM Watson AI Lab developed a way to estimate the reliability of foundation models before they’re deployed to a particular task.

They do that by training a set of foundation models which are barely different from each other. Then they use their algorithm to evaluate the consistency of the representations each model learns in regards to the same test data point. If the representations are consistent, it means the model is reliable.

After they compared their technique to state-of-the-art baseline methods, it was higher at capturing the reliability of foundation models on a wide range of classification tasks.

Someone could use this method to make a decision if a model ought to be applied in a certain setting, without the necessity to test it on a real-world dataset. This may very well be especially useful when datasets is probably not accessible because of privacy concerns, like in health care settings. As well as, the technique may very well be used to rank models based on reliability scores, enabling a user to pick the perfect one for his or her task.

“All models will be fallacious, but models that know after they are fallacious are more useful. The issue of quantifying uncertainty or reliability gets harder for these foundation models because their abstract representations are difficult to check. Our method permits you to quantify how reliable a representation model is for any given input data,” says senior writer Navid Azizan, the Esther and Harold E. Edgerton Assistant Professor within the MIT Department of Mechanical Engineering and the Institute for Data, Systems, and Society (IDSS), and a member of the Laboratory for Information and Decision Systems (LIDS).

He’s joined on a paper in regards to the work by lead writer Young-Jin Park, a LIDS graduate student; Hao Wang, a research scientist on the MIT-IBM Watson AI Lab; and Shervin Ardeshir, a senior research scientist at Netflix. The paper shall be presented on the Conference on Uncertainty in Artificial Intelligence.

Counting the consensus

Traditional machine-learning models are trained to perform a particular task. These models typically make a concrete prediction based on an input. As an illustration, the model might let you know whether a certain image incorporates a cat or a dog. On this case, assessing reliability could simply be a matter of the ultimate prediction to see if the model is correct.

But foundation models are different. The model is pretrained using general data, in a setting where its creators don’t know all downstream tasks it should be applied to. Users adapt it to their specific tasks after it has already been trained.

Unlike traditional machine-learning models, foundation models don’t give concrete outputs like “cat” or “dog” labels. As a substitute, they generate an abstract representation based on an input data point.

To evaluate the reliability of a foundation model, the researchers used an ensemble approach by training several models which share many properties but are barely different from each other.

“Our idea is like counting the consensus. If all those foundation models are giving consistent representations for any data in our dataset, then we are able to say this model is reliable,” Park says.

But they bumped into an issue: How could they compare abstract representations?

“These models just output a vector, comprised of some numbers, so we are able to’t compare them easily,” he adds.

They solved this problem using an idea called neighborhood consistency.

For his or her approach, the researchers prepare a set of reliable reference points to check on the ensemble of models. Then, for every model, they investigate the reference points situated near that model’s representation of the test point.

By the consistency of neighboring points, they will estimate the reliability of the models.

Aligning the representations

Foundation models map data points in what’s often called a representation space. One technique to take into consideration this space is as a sphere. Each model maps similar data points to the identical a part of its sphere, so images of cats go in a single place and pictures of dogs go in one other.

But each model would map animals otherwise in its own sphere, so while cats could also be grouped near the South Pole of 1 sphere, one other model could map cats somewhere within the Northern Hemisphere.

The researchers use the neighboring points like anchors to align those spheres in order that they could make the representations comparable. If a knowledge point’s neighbors are consistent across multiple representations, then one ought to be confident in regards to the reliability of the model’s output for that time.

After they tested this approach on a wide selection of classification tasks, they found that it was rather more consistent than baselines. Plus, it wasn’t tripped up by difficult test points that caused other methods to fail.

Furthermore, their approach will be used to evaluate reliability for any input data, so one could evaluate how well a model works for a specific form of individual, comparable to a patient with certain characteristics.

“Even when the models all have average performance overall, from a person standpoint, you’d prefer the one which works best for that individual,” Wang says.

Nonetheless, one limitation comes from the proven fact that they have to train an ensemble of enormous foundation models, which is computationally expensive. In the long run, they plan to search out more efficient ways to construct multiple models, perhaps through the use of small perturbations of a single model.

This work is funded, partly, by the MIT-IBM Watson AI Lab, MathWorks, and Amazon.

Tips on how to assess a general-purpose AI model’s reliability before it’s deployed

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Statement from Dario Amodei on our discussions with the Department of War Anthropic

Google quantum-proofs HTTPS by squeezing 2.5kB of information into 64-byte space – Ars Technica

Generative AI, Discriminative Human

Featured video: Coding for underwater robotics

Maximizing GPU Utilization with NVIDIA Run:ai and NVIDIA NIM

Tips on how to assess a general-purpose AI model’s reliability before it’s deployed

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.