MIT researchers have identified significant examples of machine-learning model failure when those models are applied to data apart from what they were trained on, raising questions on the necessity to test every time a model is deployed in a brand new setting.
“We display that even whenever you train models on large amounts of information, and select the most effective average model, in a brand new setting this ‘best model’ may very well be the worst model for 6-75 percent of the brand new data,” says Marzyeh Ghassemi, an associate professor in MIT’s Department of Electrical Engineering and Computer Science (EECS), a member of the Institute for Medical Engineering and Science, and principal investigator on the Laboratory for Information and Decision Systems.
In a paper that was presented on the Neural Information Processing Systems (NeurIPS 2025) conference in December, the researchers indicate that models trained to effectively diagnose illness in chest X-rays at one hospital, for instance, could also be considered effective in a unique hospital, on average. The researchers’ performance assessment, nevertheless, revealed that a few of the best-performing models at the primary hospital were the worst-performing on as much as 75 percent of patients on the second hospital, regardless that when all patients are aggregated within the second hospital, high average performance hides this failure.
Their findings display that although spurious correlations — an easy example of which is when a machine-learning system, not having “seen” many cows pictured on the beach, classifies a photograph of a beach-going cow as an orca just because of its background — are regarded as mitigated by just improving model performance on observed data, they really still occur and remain a risk to a model’s trustworthiness in recent settings. In lots of instances — including areas examined by the researchers reminiscent of chest X-rays, cancer histopathology images, and hate speech detection — such spurious correlations are much harder to detect.
Within the case of a medical diagnosis model trained on chest X-rays, for instance, the model could have learned to correlate a selected and irrelevant marking on one hospital’s X-rays with a certain pathology. At one other hospital where the marking will not be used, that pathology may very well be missed.
Previous research by Ghassemi’s group has shown that models can spuriously correlate such aspects as age, gender, and race with medical findings. If, as an illustration, a model has been trained on more older people’s chest X-rays which have pneumonia and hasn’t “seen” as many X-rays belonging to younger people, it would predict that only older patients have pneumonia.
“We wish models to learn the best way to have a look at the anatomical features of the patient after which make a call based on that,” says Olawale Salaudeen, an MIT postdoc and the lead writer of the paper, “but really anything that’s in the info that’s correlated with a call could be utilized by the model. And people correlations won’t actually be robust with changes within the environment, making the model predictions unreliable sources of decision-making.”
Spurious correlations contribute to the risks of biased decision-making. Within the NeurIPS conference paper, the researchers showed that, for instance, chest X-ray models that improved overall diagnosis performance actually performed worse on patients with pleural conditions or enlarged cardiomediastinum, meaning enlargement of the center or central chest cavity.
Other authors of the paper included PhD students Haoran Zhang and Kumail Alhamoud, EECS Assistant Professor Sara Beery, and Ghassemi.
While previous work has generally accepted that models ordered best-to-worst by performance will preserve that order when applied in recent settings, called accuracy-on-the-line, the researchers were in a position to display examples of when the best-performing models in a single setting were the worst-performing in one other.
Salaudeen devised an algorithm called OODSelect to seek out examples where accuracy-on-the-line was broken. Mainly, he trained 1000’s of models using in-distribution data, meaning the info were from the primary setting, and calculated their accuracy. Then he applied the models to the info from the second setting. When those with the best accuracy on the first-setting data were fallacious when applied to a big percentage of examples within the second setting, this identified the issue subsets, or sub-populations. Salaudeen also emphasizes the risks of aggregate statistics for evaluation, which may obscure more granular and consequential details about model performance.
In the midst of their work, the researchers separated out the “most miscalculated examples” in order to not conflate spurious correlations inside a dataset with situations which can be simply difficult to categorise.
The NeurIPS paper releases the researchers’ code and a few identified subsets for future work.
Once a hospital, or any organization employing machine learning, identifies subsets on which a model is performing poorly, that information could be used to enhance the model for its particular task and setting. The researchers recommend that future work adopt OODSelect with the intention to highlight targets for evaluation and design approaches to improving performance more consistently.
“We hope the released code and OODSelect subsets turn out to be a steppingstone,” the researchers write, “toward benchmarks and models that confront the hostile effects of spurious correlations.”
