A Comprehensive Overview of Regression Evaluation Metrics

Artificial Intelligence

A Comprehensive Overview of Regression Evaluation Metrics

admin

May 1, 2023

A Comprehensive Overview of Regression Evaluation Metrics

Image created by writer using icons from icons8

An intensive reference into commonly used regression evaluation metrics and their practical applications across various scenarios

As a knowledge scientist, evaluating the performance of machine learning models is a vital aspect of your work. To achieve this effectively, you could have a big selection of statistical metrics at your disposal, each with its own unique strengths and weaknesses. By developing a solid understanding of those metrics, you should not only higher equipped to decide on one of the best one for optimizing your model but in addition to clarify your selection and its implications to business stakeholders.

In this text, I deal with metrics which are used to judge regression problems which predict numeric values — reminiscent of the worth of a house or a forecast of an organization’s sales for next month. Since regression evaluation is taken into account to be the muse of information science, it is important to know its nuances.

Residuals are the constructing blocks of nearly all of the metrics. In easy terms, a residual is a difference between the actual value and the expected one.

residual = actual - prediction

The next figure presents the connection between a goal variable (y) and a single feature (x). The blue dots represent observations. The red line is the fit of a machine learning model, on this case, a linear regression. The orange lines represent the differences between the observed values and the predictions for those observations. As such, residuals might be calculated for every remark within the dataset, be it the training or test set.

*Figure 1. Example of residuals in a linear model with one feature*

This section discusses a number of the hottest regression evaluation metrics that may assist you assess the effectiveness of your model.

Bias

The best error measure can be the sum of residuals, sometimes known as bias. Because the residuals might be each positive (prediction is smaller than the actual value) and negative (prediction is larger than the actual value), bias generally tells us whether our predictions were higher or lower than the actuals.

Nonetheless, because the residuals of opposing signs offset one another, we will obtain a model that generates predictions with a really low bias, while not being accurate in any respect.

Alternatively, we will calculate the typical residual, or mean bias error (MBE).

R-squared

The subsequent metric might be the primary one you encounter while learning about regression models, especially if that’s during statistics or econometrics classes. R-squared (R²), also often called the coefficient of determination, represents the proportion of variance explained by a model. To be more precise, R² corresponds to the degree to which the variance within the dependent variable (the goal) might be explained by the independent variables (features).

The next formula is used to calculate R².

Where:

RSS is the residual sum of squares, which is the sum of squared residuals. This value captures the prediction error of a model.
TSS is the overall sum of squares. To calculate this value, first we assume an easy model during which the prediction for every remark is the mean of all of the observed actuals. TSS is proportional to the variance of the dependent variable, as TSS/N is the actual variance of y where N is the variety of observations. That’s the reason we will consider TSS because the variance that an easy mean model cannot explain.

Effectively, we’re comparing the fit of a model (represented by the red line in Figure 2) to that of an easy mean model (represented by the green line).

*Figure 2. Comparing the fit of a linear model to an easy mean benchmark*

Knowing what the components of R² stand for, we will see that RSS/TSS represents the fraction of the overall variance within the goal that our model was not capable of explain.

There are quite a number of additional points to be mindful when working with R².

Initially, R² is a relative metric, that’s, it could be used to check with other models trained on the identical dataset. A better value indicates a greater fit.

R² may also be used to get a rough estimate of how the model performs typically. Nonetheless, we must always watch out when using R² for such evaluations:

First, different fields (social sciences, biology, finance, and more) consider different values of R² nearly as good or bad.
Second, R² doesn’t give any measure of bias, so we will have an overfitted (highly biased) model with a high value of R². As such, we must always also have a look at other metrics to get a superb understanding of a model’s performance.

A possible drawback of R² is that it assumes that each feature helps in explaining the variation within the goal, when that may not all the time be the case. Because of this, if we proceed adding features to a linear model estimated using strange least squares (OLS), the worth of R² might increase or remain unchanged, but it surely never decreases.

Why? By design, OLS estimation minimizes the RSS. Suppose a model with an extra feature doesn’t improve the worth of R² of the primary model. In that case, the OLS estimation technique sets that feature’s coefficients to zero (or some statistically insignificant value). In turn, this effectively brings us back to the initial model. Within the worst-case scenario, we will get the rating of our start line.

An answer to the issue mentioned within the previous point is the adjusted R², which moreover penalizes adding features that should not useful for predicting the goal. The worth of the adjusted R² decreases if the rise within the R² brought on by adding latest features will not be significant enough.

Because the last point, we left the usually misunderstood issue of R²’s range of values. If a linear model is fitted using OLS, the range of R² is 0 to 1. That’s because when using the OLS estimation (which minimizes the RSS), the final property is that RSS ≤ TSS. Within the worst-case scenario, OLS estimation would end in obtaining the mean model. In that case, RSS can be equal to TSS and end in the minimum value of R² being 0. However, one of the best case can be RSS = 0 and R² = 1.

Within the case of non-linear models, it is feasible that R² is negative. Because the model fitting procedure of such models will not be based on iteratively minimizing the RSS, the fitted model could have an RSS greater than the TSS. In other words, the model’s predictions fit the info worse than the straightforward mean model. For more, information, see When is R squared negative?

Bonus: Using R², we will evaluate how significantly better our model suits the info as in comparison with the straightforward mean model. We are able to consider a positive R² value when it comes to improving the performance of a baseline model — something along the lines of a skill rating. For instance, R² of 40% indicates that our model has reduced the mean squared error by 40% in comparison with the baseline mean model.

Mean squared error

Mean squared error (MSE) is one of the vital popular evaluation metrics. As shown in the next formula, MSE is closely related to the residual sum of squares. The difference is that we are actually occupied with the typical error as an alternative of the overall error.

Listed here are some points to have in mind when working with MSE:

MSE uses the mean (as an alternative of the sum) to maintain the metric independent of the dataset size.
Because the residuals are squared, MSE puts a significantly heavier penalty on large errors. A few of those is likely to be outliers, so MSE will not be robust to their presence.
Because the metric is expressed using squares, sums, and constants (1/N) it’s differentiable. This is beneficial for optimization algorithms.
While optimizing for MSE (setting its derivative to 0), the model goals for the overall sum of predictions to be equal to the overall sum of actuals. That’s, it results in predictions which are correct on average. Subsequently, they’re unbiased.
MSE will not be measured in the unique units, which might make it harder to interpret.
MSE is an example of a scale-dependent metric, that’s, the error is expressed within the units of the underlying data (regardless that it actually needs a square root to be expressed on the identical scale). Subsequently, such metrics can’t be used to check the performance between different datasets.

Root mean squared error

Root mean squared error (RMSE) is closely related to MSE, because it is solely the square root of the latter. By taking the square we bring the metric back to the size of the goal variable, so it is less complicated to interpret and understand. Nonetheless, one incontrovertible fact that is usually neglected is that although RMSE is on the identical scale because the goal, an RMSE of 10 doesn’t actually mean we’re off by 10 units on average.

Aside from the size, RMSE has the identical properties as MSE. As a matter of fact, optimizing for RMSE while training a model will end in the identical model as that obtained while optimizing for MSE.

Mean absolute error

The formula to calculate mean absolute error (MAE) is analogous to the MSE formula. We simply have to exchange the square with absolutely the value.

Characteristics of MAE include the next:

As a consequence of the shortage of squaring, the metric is expressed at the identical scale because the goal variable, making it easier to interpret.
All errors are treated equally, so the metric is powerful to outliers.
Absolute value disregards the direction of the errors, so underforecasting = overforecasting.
Just like MSE and RMSE, MAE can be scale-dependent, so we cannot compare it between different datasets.
Whenever you optimize for MAE, the prediction should be as again and again higher than the actual value appropriately lower. That signifies that we’re effectively on the lookout for the median; that’s, a worth that splits a dataset into two equal parts.
Because the formula accommodates absolute values, MAE will not be easily differentiable.

Mean absolute percentage error

Mean absolute percentage error (MAPE) is one of the vital popular metrics on the business side. That’s since it is expressed as a percentage, which makes it much easier to know and interpret.

To make the metric even easier to read, we will multiply it by 100% to specific the number as a percentage.

Points to think about:

MAPE is expressed as a percentage, which makes it a scale-independent metric. It will possibly be used to check predictions on different scales.
MAPE can exceed 100%.
MAPE is undefined when the actuals are zero (division by zero). Moreover, it could take extreme values when the actuals are very near zero.
MAPE is asymmetric and puts a heavier penalty on negative errors (when predictions are higher than actuals) than on positive ones. That is brought on by the incontrovertible fact that the share error cannot exceed 100% for forecasts which are too low. Meanwhile, there isn’t a upper limit for forecasts which are too high. Because of this, optimizing for MAPE will favor models that underforecast somewhat than overforecast.
Hyndman (2021) elaborates on the often-forgotten assumption of MAPE, that’s, the unit of measurement of the variable has a meaningful zero value. As such, forecasting demand and using MAPE don’t raise any red flags. Nonetheless, we are going to encounter that problem when forecasting temperature expressed on the Celsius scale (and never only that one). That’s since the temperature has an arbitrary zero point and it doesn’t make sense to discuss percentages of their context.
MAPE will not be differentiable all over the place, which may end up in problems while using it because the optimization criterion.
As MAPE is a relative metric, the identical error can lead to a distinct loss depending on the actual value. For instance, for a predicted value of 60 and an actual of 100, the MAPE can be 40%. For a predicted value of 60 and an actual of 20, the nominal error remains to be 40, but on the relative scale, it’s 300%.
Unfortunately, MAPE doesn’t provide a superb option to differentiate the vital from the irrelevant. Assume we’re working on demand forecasting and over a horizon of a number of months, we get a MAPE of 10% for 2 different products. Then, it seems that the primary product sells a mean of 1 million units per 30 days, while the opposite only 100. Each have the identical 10% MAPE. When aggregating over all products, those two would contribute equally, which might be removed from desirable. In such cases, it is sensible to think about weighted MAPE (wMAPE).

Symmetric mean absolute percentage error

While discussing MAPE, I discussed that one in all its potential drawbacks is its asymmetry (not limiting the predictions which are higher than the actuals). Symmetric mean absolute percentage error (sMAPE) is a related metric that attempts to repair that issue.

Points to think about when using sMAPE:

It’s expressed as a bounded percentage, that’s, it has lower (0%) and upper (200%) bounds.
The metric remains to be unstable when each the true value and the forecast are very near zero. When it happens, we are going to cope with division by a number very near zero.
The range of 0% to 200% will not be intuitive to interpret. Dividing by two within the denominator is usually omitted.
At any time when the actual value or the forecast has a worth is 0, sMAPE will mechanically hit the upper boundary value.
sMAPE includes the identical assumptions as MAPE regarding a meaningful zero value.
While fixing the asymmetry of boundlessness, sMAPE introduces one other form of delicate asymmetry brought on by the denominator of the formula. Imagine two cases. In the primary one, we’ve A = 100 and F = 120. The sMAPE is eighteen.2%. Now the same case, during which we’ve A = 100 and F = 80, the sMAPE is 22.2%. As such, sMAPE tends to penalize underforecasting more severely than overforecasting.
sMAPE is likely to be one of the vital controversial error metrics, especially in time series forecasting. That’s because there are not less than a number of versions of this metric within the literature, every one with slight differences that impact its properties. Finally, the name of the metric suggests that there isn’t a asymmetry, but that will not be the case.

I actually have not described all of the possible regression evaluation metrics, as there are dozens (if not tons of). Listed here are a number of more metrics to think about while evaluating models:

Mean squared log error (MSLE) is a cousin of MSE, with the difference that we take the log of the actuals and predictions before calculating the squared error. Taking the logs of the 2 elements in subtraction ends in measuring the ratio or relative difference between the actual value and the prediction, while neglecting the size of the info. That’s the reason MSLE reduces the impact of outliers on the ultimate rating. MSLE also puts a heavier penalty on underforecasting.
Root mean squared log error (RMSLE) is a metric that takes the square root of MSLE. It has the identical properties as MSLE.
Akaike information criterion (AIC) and Bayesian information criterion (BIC) are examples of knowledge criteria. They’re used to seek out a balance between a superb fit and the complexity of a model. If we start with an easy model with a number of parameters and add more, our model will probably fit the training data higher. Nonetheless, it’ll also grow in complexity and risk overfitting. However, if we start with many parameters and systematically remove a few of them, the model becomes simpler. At the identical time, we reduce the danger of overfitting on the potential cost of losing on performance (goodness of fit). The difference between AIC and BIC is the load of the penalty for complexity. Remember that it will not be valid to check the knowledge criteria on different datasets and even subsamples of the identical dataset but with a distinct variety of observations.

As with nearly all of data science problems, there isn’t a single best metric for evaluating the performance of a regression model. The metric chosen for a use case will depend upon the info used to coach the model, the business case we are attempting to assist, and so forth. Because of this, we’d often use a single metric for the training of a model (the metric optimized for), but when reporting to stakeholders, data scientists often present a collection of metrics.

While selecting the metrics, consider a number of of the next questions:

Do you expect frequent outliers within the dataset? If that’s the case, how do you should account for them?
Is there a business preference for overforecasting or underforecasting?
Do you wish a scale-dependent or scale-independent metric?

I think it is beneficial to explore the metrics on some toy cases to completely understand their nuances. While many of the metrics can be found within the metrics module of scikit-learn, for this particular task, the great old spreadsheets is likely to be a more suitable tool.

The next example accommodates five observations. Table 1 shows the actual values, predictions, and a few metrics used to calculate many of the considered metrics.

*Table 1. Example of calculating the performance metrics on five observations*

The primary three rows contain scenarios during which absolutely the difference between the actual value and the prediction is 20. The primary two rows show an overforecast and underforecast of 20, with the identical actual. The third row shows an overforecast of 20, but with a smaller actual. In those rows, it is simple to watch the particularities of MAPE and sMAPE.

*Table 2. Performance metrics calculated using the values in Table 1*

The fifth row in Table 1 contained a prediction 8x smaller than the actual value. For the sake of experimentation, replace that prediction with one which is 8x higher than the actual. Table 3 accommodates the revised observations.

*Table 3. Performance metrics after modifying a single remark to be more extreme*

*Table 4. Performance metrics calculated using the modified values in Table 3*

Principally, all metrics exploded in size, which is intuitively consistent. That will not be the case for sMAPE, which stayed the identical between each cases.

I highly encourage you to mess around with such toy examples to more fully understand how different sorts of scenarios impact the evaluation metrics. This experimentation should make you more comfortable making a choice on which metric to optimize for and the results of such a selection. These exercises may additionally assist you explain your decisions to stakeholders.

On this post, I covered a number of the hottest regression evaluation metrics. As explained, every one comes with its own set of benefits and downsides. And it’s as much as the info scientist to know those and make a selection about which one (or more) is suitable for a specific use case. The metrics mentioned may also be applied to pure regression tasks — reminiscent of predicting salary based on a collection of features connected to experience — but in addition to the domain of time series forecasting.

As all the time, any constructive feedback is greater than welcome. You’ll be able to reach out to me on Twitter or within the comments.

Liked the article? Turn out to be a Medium member to proceed learning by reading without limits. In the event you use this link to grow to be a member, you’ll support me at no extra cost to you. Thanks upfront and see you around!

You may additionally be occupied with one in all the next:

Jadon, A., Patil, A., & Jadon, S. (2022). A Comprehensive Survey of Regression Based Loss Functions for Time Series Forecasting. arXiv preprint arXiv:2211.02989.

Hyndman, R. J. (2006). One other have a look at forecast-accuracy metrics for intermittent demand. Foresight: The International Journal of Applied Forecasting, 4(4), 43–46.

Hyndman, R. J., & Koehler, A. B. (2006). One other have a look at measures of forecast accuracy. International journal of forecasting, 22(4), 679–688.

Hyndman, R.J., & Athanasopoulos, G. (2021) Forecasting: principles and practice, third edition, OTexts: Melbourne, Australia. OTexts.com/fpp3.