Multiple Linear Regression Evaluation

full code for this instance at the underside of this post.

Multiple regression is used when your response variable Y is continuous and you may have at the least k covariates, or independent variables which can be linearly correlated with it. The information are of the shape:

(Y₁, X₁), … ,(Yᵢ, Xᵢ), … ,(Yₙ, Xₙ)

where Xᵢ = (Xᵢ₁, …, Xᵢₖ) is a vector of covariates and n is the variety of observations. Here, Xi is the vector of k covariate values for the ith remark.

Understanding the Data

To make this concrete, imagine the next scenario:

You enjoy running and tracking your performance by recording the gap you run every day. Over 100 consecutive days, you collect 4 pieces of knowledge:

The gap you run,
The variety of hours you spent running,
The variety of hours you slept last night,
And the variety of hours you worked

Now, on the one hundred and first day, you recorded every thing except the gap you ran. You desire to estimate that missing value using the data you do have: the variety of hours you spent running, the variety of hours you slept the night before, and the variety of hours you worked on that day.

To do that, you’ll be able to depend on the information from the previous 100 days, which takes the shape:

(Y₁, X₁), … , (Yᵢ, Xᵢ), … , (Y₁₀₀, X₁₀₀)

Here, each Yᵢ is the gap you ran on day , and every covariate vector Xᵢ = (Xᵢ₁, Xᵢ₂, Xᵢ₃) corresponds to:

Xᵢ₁: variety of hours spent running,
Xᵢ₂: variety of hours slept the previous night,
Xᵢ₃: variety of hours worked on that day.

The index refers back to the 100 days with complete data. With this dataset, you’ll be able to now fit a multiple linear regression model to estimate the missing response variable for day 101.

Specification of the model

If we assume the linear relationship between the response variable and the covariates, which you’ll measure using the Pearson correlation, we are able to specify the model as:

Specification of linear regression model

for i = 1, …, n where E(ϵᵢ | Xᵢ₁, … , Xᵢₖ). To keep in mind the intercept, the primary variable is about to Xᵢ₁ = 1, for i =1, …, n. To estimate the coefficient, the model is expressed in matrix notation.

And the covariates shall be denoted by:

X is the **design matrix** (with an intercept and k covariates)

β is a column vector of coefficients, utilized in the linear regression model; ε is a column vector of random error terms, one for every remark.

Then, we are able to rewrite the model as:

Y = Xβ + ε

Estimation of coefficients

Assuming that the (k+1)*(k+1) matrix is invertible, the shape of the least squares estimate is given by:

We are able to derive the estimate of the regression function, an unbiased estimate of σ², and an approximate 1−α confidence interval for βⱼ:

Estimate of the regression function: r(x) = ∑ⱼ₌₁ᵏ βⱼ xⱼ
σ̂² = (1 / (n − k)) × ∑ᵢ₌₁ⁿ ε̂ᵢ² where ϵ̂ = Y − Xβ̂ is the vector of residuals.
And β̂ⱼ ± tₙ₋ₖ,₁₋α⁄₂ × SE(β̂ⱼ) is an approximate (1 − α) confidence interval. Where SE(β̂ⱼ) is the jth diagonal element of the matrix σ̂² (Xᵀ X)⁻¹

Example of application

Because we didn’t record the information of our running performance, we’ll use a criminal offense dataset from 47 states in 1960 that might be obtained from here. Before we fit a linear regression, there are a lot of steps we must follow.

Understanding different variables of the information.

The primary 9 observations of the information are given by:

 R	   Age	S	Ed	Ex0	Ex1	LF	M	N	NW	U1	U2	W	X
79.1	151	1	91	58	56	510	950	33	301	108	41	394	261
163.5	143	0	113	103	95	583	1012 13	102	96	36	557	194
57.8	142	1	89	45	44	533	969	18	219	94	33	318	250
196.9	136	0	121	149	141	577	994	157	80	102	39	673	167
123.4	141	0	121	109	101	591	985	18	30	91	20	578	174
68.2	121	0	110	118	115	547	964	25	44	84	29	689	126
96.3	127	1	111	82	79	519	982	4	139	97	38	620	168
155.5	131	1	109	115	109	542	969	50	179	79	35	472	206
85.6	157	1	90	65	62	553	955	39	286	81	28	421	239

The information has 14 continuous variables (the response variable R, the 12 predictor variables, and one categorical variable S):

R: Crime rate: # of offenses reported to police per million population
Age: The variety of males of age 14–24 per 1000 population
S: Indicator variable for Southern states (0 = No, 1 = Yes)
Ed: Mean # of years of education x 10 for individuals of age 25 or older
Ex0: 1960 per capita expenditure on police by state and native government
Ex1: 1959 per capita expenditure on police by state and native government
LF: Labor force participation rate per 1000 civilian urban males age 14–24
M: The variety of males per 1000 females
N: State population size in hundred 1000’s
NW: The variety of non-whites per 1000 population
U1: Unemployment rate of urban males per 1000 of age 14–24
U2: Unemployment rate of urban males per 1000 of age 35–39
W: Median value of transferable goods and assets or family income in tens of $
X: The variety of families per 1000 earning below 1/2 the median income

The information doesn’t have missing values.

Graphical evaluation of the connection between the covariates X and the response variable Y

Graphical evaluation of the connection between explanatory variables and the response variable is a step when performing linear regression.

It helps visualize linear trends, detect anomalies, and assess the relevance of variables before constructing any model.

**Box plots and scatter plots with fitted linear regression lines** illustrate the trend between each variable and R.

Some variables are positively correlated with the crime rate, while others are negatively correlated.

As an illustration, we observe a powerful positive relationship between R (the crime rate) and Ex1.

In contrast, age appears to be negatively correlated with crime.

Finally, the boxplot of the binary variable S (indicating region: North or South) suggests that the crime rate is comparatively similar between the 2 regions. Then, we are able to analyse the correlation matrix.

Heatmap of Pearson correlation matrix

The correlation matrix allows us to review the strength of the connection between variables. While the Pearson correlation is often used to measure linear relationships, the Spearman Correlation is more appropriate when we wish to capture monotonic, potentially non-linear relationships between variables.

On this evaluation, we’ll use the Spearman correlation to higher account for such non-linear associations.

A **heatmap of the correlation matrix** in Python

The primary row of the correlation matrix shows the strength of the connection between each covariate and the response variable R.

For instance, Ex0 and Ex1 each show a correlation greater than 60% with R, indicating a powerful association. These variables seem like good predictors of the crime rate.

Nevertheless, for the reason that correlation between Ex0 and Ex1 is sort of perfect, they likely convey similar information. To avoid redundancy, we are able to select just certainly one of them, preferably the one with the strongest correlation with R.

When several variables are strongly correlated with one another (a correlation of 60%, for instance), they have a tendency to hold redundant information. In such cases, we keep only certainly one of them — the one which is most strongly correlated with the response variable R. This allow us to scale back multicollinearity.

This exercise allows us to pick these variables : [‘Ex1’, ‘LF’, ‘M’, ’N’, ‘NW’, ‘U2’].

Study of multicollinearity using the VIF (Variance Inflation Aspects)

Before fitting the logistic regression, it is necessary to review the multicollinearity.

When correlation exists amongst predictors, the usual errors of the coefficient estimates increase, resulting in an inflation of their variances. The Variance Inflation Factor (VIF) is a diagnostic tool used to measure how much the variance of a predictor’s coefficient is inflated attributable to multicollinearity, and it is often provided within the regression output under a “VIF” column.

This VIF is calculated for every predictor within the model. The approach is to regress the i-th predictor variable against all the opposite predictors. We then obtain Rᵢ², which might be used to compute the VIF using the formula:

The table below presents the VIF values for the six remaining variables, all of that are below 5. This means that multicollinearity is just not a priority, and we are able to proceed with fitting the linear regression model.

Fitting a linear regression on six variables

If we fit a linear regression of crime rate on 10 variables, we get the next:

Output of the Multiple Linear Regression Evaluation. The corresponding code is provided within the appendix.

Diagnosis of residuals

Before interpreting the regression results, we must first assess the standard of the residuals, particularly by checking for autocorrelation, homoscedasticity (constant variance), and normality. The diagnostic of residuals is given by the table below:

Diagnosis of the residuals. Come to the summary of the regression

The Durbin-Watson ≈2 indicates no autocorrelation in residuals.
From the omnibus to Kurtosis, all values show that the residuals are symmetric and have a standard distribution.
The low condition number (3.06) confirms that there is no such thing as a multicollinearity among the many predictors.

Major Points to Remember

We may assess the general quality of the model through indicators reminiscent of the R-squared and F-statistic, which show satisfactory leads to this case.

We are able to now interpret the regression coefficients from a statistical perspective.
We intentionally exclude any business-specific interpretation of the outcomes.
The target of this evaluation is as an instance a couple of easy and essential steps for modeling an issue using multiple linear regression.

On the 5% significance level, two coefficients are statistically significant: Ex1 and NW.

This is just not surprising, as these were the 2 variables that showed a correlation greater than 40% with the response variable R. Variables that are usually not statistically significant could also be removed or re-evaluated, or retained, depending on the study’s context and objectives.

This post gives you some guidelines to perform linear regression:

It can be crucial to examine linearity through graphical evaluation and to review the correlation between the response variable and the predictors.
Examining correlations amongst variables helps reduce multicollinearity and supports variable selection.
When two predictors are highly correlated, they could convey redundant information. In such cases, you’ll be able to retain the one which is more strongly correlated with the response, or — based on domain expertise — the one with greater business relevance or practical interpretability.
The Variance Inflation Factor (VIF) is a great tool to quantify and assess multicollinearity.
Before interpreting the model coefficients statistically, it is important to confirm the autocorrelation, normality, and homoscedasticity of the residuals to be certain that the model assumptions are met.

While this evaluation provides precious insights, it also has certain limitations.

The absence of missing values within the dataset simplifies the study, but this isn’t the case in real-world scenarios.

When you’re constructing a predictive model, it’s essential to split the information into training, testing, and potentially an out-of-time validation set to make sure robust evaluation.

For variable selection, techniques reminiscent of stepwise selection and other feature selection methods might be applied.

When comparing multiple models, it’s essential to define appropriate performance metrics.

Within the case of linear regression, commonly used metrics include the Mean Absolute Error (MAE) and the Mean Squared Error (MSE).

Image Credits

All images and visualizations in this text were created by the creator using Python (pandas, matplotlib, seaborn, and plotly) and excel, unless otherwise stated.

References

Wasserman, L. (2013). . Springer Science & Business Media.

Data & Licensing

The dataset utilized in this text accommodates crime-related and demographic statistics for 47 U.S. states in 1960.
It originates from the FBI’s Uniform Crime Reporting (UCR) Program and extra U.S. government sources.

As a U.S. government work, the information is in the general public domain under 17 U.S. Code § 105 and is free to make use of, share, and reproduce without restriction.

Sources:

Codes

Import data

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv('data/Multiple_Regression_Dataset.csv')
df.head()

Create a brand new figure

# Extract response variable and covariates
response = 'R'
covariates = [col for col in df.columns if col != response]

fig, axes = plt.subplots(nrows=4, ncols=4, figsize=(20, 18))
axes = axes.flatten()

# Plot boxplot for binary variable 'S'
sns.boxplot(data=df, x='S', y='R', ax=axes[0])
axes[0].set_title('Boxplot of R by S')
axes[0].set_xlabel('S')
axes[0].set_ylabel('R')

# Plot regression lines for all other covariates
plot_index = 1
for cov in covariates:
    if cov != 'S':
        sns.regplot(data=df, x=cov, y='R', ax=axes[plot_index], scatter=True, line_kws={"color": "red"})
        axes[plot_index].set_title(f'{cov} vs R')
        axes[plot_index].set_xlabel(cov)
        axes[plot_index].set_ylabel('R')
        plot_index += 1

# Hide unused subplots
for i in range(plot_index, len(axes)):
    fig.delaxes(axes[i])

fig.tight_layout()
plt.show()

Evaluation of the correlation between variables

spearman_corr = df.corr(method='spearman')
plt.figure(figsize=(12, 10))
sns.heatmap(spearman_corr, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Correlation Matrix Heatmap")
plt.show()

Filtering Predictors with High Intercorrelation (ρ > 0.6)

# Step 2: Correlation of every variable with response R
spearman_corr_with_R = spearman_corr['R'].drop('R')  # exclude R-R

# Step 3: Discover pairs of covariates with strong inter-correlation (e.g., > 0.9)
strong_pairs = []
threshold = 0.6
covariates = spearman_corr_with_R.index

for i, var1 in enumerate(covariates):
    for var2 in covariates[i+1:]:
        if abs(spearman_corr.loc[var1, var2]) > threshold:
            strong_pairs.append((var1, var2))

# Step 4: From each correlated pair, keep only the variable most correlated with R
to_keep = set()
to_discard = set()

for var1, var2 in strong_pairs:
    if abs(spearman_corr_with_R[var1]) >= abs(spearman_corr_with_R[var2]):
        to_keep.add(var1)
        to_discard.add(var2)
    else:
        to_keep.add(var2)
        to_discard.add(var1)

# Final selection: all covariates excluding those to discard attributable to redundancy
final_selected_variables = [var for var in covariates if var not in to_discard]

final_selected_variables

Evaluation of multicollinearity using VIF

from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
from sklearn.preprocessing import StandardScaler

X = df[final_selected_variables]  

X_with_const = add_constant(X)  

vif_data = pd.DataFrame()
vif_data["variable"] = X_with_const.columns
vif_data["VIF"] = [variance_inflation_factor(X_with_const.values, i)
                   for i in range(X_with_const.shape[1])]

vif_data = vif_data[vif_data["variable"] != "const"]

print(vif_data)

Fit a linear regression model on six variables after standardization, not splitting the information into train and test

from sklearn.preprocessing import StandardScaler
from statsmodels.api import OLS, add_constant
import pandas as pd

# Variables
X = df[final_selected_variables]
y = df['R']

scaler = StandardScaler()
X_scaled_vars = scaler.fit_transform(X)

X_scaled_df = pd.DataFrame(X_scaled_vars, columns=final_selected_variables)

X_scaled_df = add_constant(X_scaled_df)

model = OLS(y, X_scaled_df).fit()
print(model.summary())

Multiple Linear Regression Evaluation

Understanding the Data

Specification of the model

Estimation of coefficients

Example of application

Graphical evaluation of the connection between the covariates X and the response variable Y

Heatmap of Pearson correlation matrix

Study of multicollinearity using the VIF (Variance Inflation Aspects)

Fitting a linear regression on six variables

Diagnosis of residuals

Major Points to Remember

Image Credits

References

Data & Licensing

Codes

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

An Introduction to AI Secure LLM Safety Leaderboard

Generative coding

The Hallucinations Leaderboard, an Open Effort to Measure Hallucinations in Large Language Models

Speed up StarCoder with 🤗 Optimum Intel on Xeon: Q8/Q4 and Speculative Decoding

a Leaderboard for Real World Use Cases

Multiple Linear Regression Evaluation

Understanding the Data

Specification of the model

Estimation of coefficients

Example of application

Graphical evaluation of the connection between the covariates X and the response variable Y

Heatmap of Pearson correlation matrix

Study of multicollinearity using the VIF (Variance Inflation Aspects)

Fitting a linear regression on six variables

Diagnosis of residuals

Major Points to Remember

Image Credits

References

Data & Licensing

Codes

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.