An empirical evaluation about whether ML models make more mistakes when making predictions on outliers
Outliers are individuals which might be very different from the vast majority of the population. Traditionally, amongst practitioners there may be a certain mistrust in outliers, this is the reason ad-hoc measures reminiscent of removing them from the dataset are sometimes adopted.
Nevertheless, when working with real data, outliers are on the order of business. Sometimes, they’re much more vital than other observations! Take as an example the case of people which might be outliers because they’re very high-paying customers: you don’t need to discard them, actually, you most likely need to treat them with extra care.
An interesting — and quite unexplored — aspect of outliers is how they interact with ML models. My feeling is that data scientists imagine that outliers harm the performance of their models. But this belief might be based on a preconception greater than on real evidence.
Thus, the query I’ll try to reply in this text is the next:
Is an ML model more prone to make mistakes when making predictions on outliers?
Suppose that now we have a model that has been trained on these data points:
We receive recent data points for which the model should make predictions.
Let’s consider two cases:
- the brand new data point is an outlier, i.e. different from many of the training observations.
- the brand new data point is “standard”, i.e. it lies in an area that’s pretty “dense” of coaching points.
We would love to know whether, basically, the outlier is harder to predict than the usual statement.