full code for this instance at the underside of this post.
Multiple regression is used when your response variable Y is continuous and you may have at the least k covariates, or independent variables which can be linearly correlated with it. The information are of the shape:
(Y₁, X₁), … ,(Yᵢ, Xᵢ), … ,(Yₙ, Xₙ)
where Xᵢ = (Xᵢ₁, …, Xᵢₖ) is a vector of covariates and n is the variety of observations. Here, Xi is the vector of k covariate values for the ith remark.
Understanding the Data
To make this concrete, imagine the next scenario:
You enjoy running and tracking your performance by recording the gap you run every day. Over 100 consecutive days, you collect 4 pieces of knowledge:
- The gap you run,
- The variety of hours you spent running,
- The variety of hours you slept last night,
- And the variety of hours you worked
Now, on the one hundred and first day, you recorded every thing except the gap you ran. You desire to estimate that missing value using the data you do have: the variety of hours you spent running, the variety of hours you slept the night before, and the variety of hours you worked on that day.
To do that, you’ll be able to depend on the information from the previous 100 days, which takes the shape:
(Y₁, X₁), … , (Yᵢ, Xᵢ), … , (Y₁₀₀, X₁₀₀)
Here, each Yᵢ is the gap you ran on day , and every covariate vector Xᵢ = (Xᵢ₁, Xᵢ₂, Xᵢ₃) corresponds to:
- Xᵢ₁: variety of hours spent running,
- Xᵢ₂: variety of hours slept the previous night,
- Xᵢ₃: variety of hours worked on that day.
The index refers back to the 100 days with complete data. With this dataset, you’ll be able to now fit a multiple linear regression model to estimate the missing response variable for day 101.
Specification of the model
If we assume the linear relationship between the response variable and the covariates, which you’ll measure using the Pearson correlation, we are able to specify the model as:
for i = 1, …, n where E(ϵᵢ | Xᵢ₁, … , Xᵢₖ). To keep in mind the intercept, the primary variable is about to Xᵢ₁ = 1, for i =1, …, n. To estimate the coefficient, the model is expressed in matrix notation.

And the covariates shall be denoted by:


Then, we are able to rewrite the model as:
Y = Xβ + ε
Estimation of coefficients
Assuming that the (k+1)*(k+1) matrix is invertible, the shape of the least squares estimate is given by:

We are able to derive the estimate of the regression function, an unbiased estimate of σ², and an approximate 1−α confidence interval for βⱼ:
- Estimate of the regression function: r(x) = ∑ⱼ₌₁ᵏ βⱼ xⱼ
- σ̂² = (1 / (n − k)) × ∑ᵢ₌₁ⁿ ε̂ᵢ² where ϵ̂ = Y − Xβ̂ is the vector of residuals.
- And β̂ⱼ ± tₙ₋ₖ,₁₋α⁄₂ × SE(β̂ⱼ) is an approximate (1 − α) confidence interval. Where SE(β̂ⱼ) is the jth diagonal element of the matrix σ̂² (Xᵀ X)⁻¹
Example of application
Because we didn’t record the information of our running performance, we’ll use a criminal offense dataset from 47 states in 1960 that might be obtained from here. Before we fit a linear regression, there are a lot of steps we must follow.
Understanding different variables of the information.
The primary 9 observations of the information are given by:
R Age S Ed Ex0 Ex1 LF M N NW U1 U2 W X
79.1 151 1 91 58 56 510 950 33 301 108 41 394 261
163.5 143 0 113 103 95 583 1012 13 102 96 36 557 194
57.8 142 1 89 45 44 533 969 18 219 94 33 318 250
196.9 136 0 121 149 141 577 994 157 80 102 39 673 167
123.4 141 0 121 109 101 591 985 18 30 91 20 578 174
68.2 121 0 110 118 115 547 964 25 44 84 29 689 126
96.3 127 1 111 82 79 519 982 4 139 97 38 620 168
155.5 131 1 109 115 109 542 969 50 179 79 35 472 206
85.6 157 1 90 65 62 553 955 39 286 81 28 421 239
The information has 14 continuous variables (the response variable R, the 12 predictor variables, and one categorical variable S):
- R: Crime rate: # of offenses reported to police per million population
- Age: The variety of males of age 14–24 per 1000 population
- S: Indicator variable for Southern states (0 = No, 1 = Yes)
- Ed: Mean # of years of education x 10 for individuals of age 25 or older
- Ex0: 1960 per capita expenditure on police by state and native government
- Ex1: 1959 per capita expenditure on police by state and native government
- LF: Labor force participation rate per 1000 civilian urban males age 14–24
- M: The variety of males per 1000 females
- N: State population size in hundred 1000’s
- NW: The variety of non-whites per 1000 population
- U1: Unemployment rate of urban males per 1000 of age 14–24
- U2: Unemployment rate of urban males per 1000 of age 35–39
- W: Median value of transferable goods and assets or family income in tens of $
- X: The variety of families per 1000 earning below 1/2 the median income
The information doesn’t have missing values.
Graphical evaluation of the connection between the covariates X and the response variable Y
Graphical evaluation of the connection between explanatory variables and the response variable is a step when performing linear regression.
It helps visualize linear trends, detect anomalies, and assess the relevance of variables before constructing any model.

Some variables are positively correlated with the crime rate, while others are negatively correlated.
As an illustration, we observe a powerful positive relationship between R (the crime rate) and Ex1.
In contrast, age appears to be negatively correlated with crime.
Finally, the boxplot of the binary variable S (indicating region: North or South) suggests that the crime rate is comparatively similar between the 2 regions. Then, we are able to analyse the correlation matrix.
Heatmap of Pearson correlation matrix
The correlation matrix allows us to review the strength of the connection between variables. While the Pearson correlation is often used to measure linear relationships, the Spearman Correlation is more appropriate when we wish to capture monotonic, potentially non-linear relationships between variables.
On this evaluation, we’ll use the Spearman correlation to higher account for such non-linear associations.

The primary row of the correlation matrix shows the strength of the connection between each covariate and the response variable R.
For instance, Ex0 and Ex1 each show a correlation greater than 60% with R, indicating a powerful association. These variables seem like good predictors of the crime rate.
Nevertheless, for the reason that correlation between Ex0 and Ex1 is sort of perfect, they likely convey similar information. To avoid redundancy, we are able to select just certainly one of them, preferably the one with the strongest correlation with R.
When several variables are strongly correlated with one another (a correlation of 60%, for instance), they have a tendency to hold redundant information. In such cases, we keep only certainly one of them — the one which is most strongly correlated with the response variable R. This allow us to scale back multicollinearity.
This exercise allows us to pick these variables : [‘Ex1’, ‘LF’, ‘M’, ’N’, ‘NW’, ‘U2’].
Study of multicollinearity using the VIF (Variance Inflation Aspects)
Before fitting the logistic regression, it is necessary to review the multicollinearity.
When correlation exists amongst predictors, the usual errors of the coefficient estimates increase, resulting in an inflation of their variances. The Variance Inflation Factor (VIF) is a diagnostic tool used to measure how much the variance of a predictor’s coefficient is inflated attributable to multicollinearity, and it is often provided within the regression output under a “VIF” column.

This VIF is calculated for every predictor within the model. The approach is to regress the i-th predictor variable against all the opposite predictors. We then obtain Rᵢ², which might be used to compute the VIF using the formula:

The table below presents the VIF values for the six remaining variables, all of that are below 5. This means that multicollinearity is just not a priority, and we are able to proceed with fitting the linear regression model.

Fitting a linear regression on six variables
If we fit a linear regression of crime rate on 10 variables, we get the next:

Diagnosis of residuals
Before interpreting the regression results, we must first assess the standard of the residuals, particularly by checking for autocorrelation, homoscedasticity (constant variance), and normality. The diagnostic of residuals is given by the table below:

- The Durbin-Watson ≈2 indicates no autocorrelation in residuals.
- From the omnibus to Kurtosis, all values show that the residuals are symmetric and have a standard distribution.
- The low condition number (3.06) confirms that there is no such thing as a multicollinearity among the many predictors.
Major Points to Remember
We may assess the general quality of the model through indicators reminiscent of the R-squared and F-statistic, which show satisfactory leads to this case.
We are able to now interpret the regression coefficients from a statistical perspective.
We intentionally exclude any business-specific interpretation of the outcomes.
The target of this evaluation is as an instance a couple of easy and essential steps for modeling an issue using multiple linear regression.
On the 5% significance level, two coefficients are statistically significant: Ex1 and NW.
This is just not surprising, as these were the 2 variables that showed a correlation greater than 40% with the response variable R. Variables that are usually not statistically significant could also be removed or re-evaluated, or retained, depending on the study’s context and objectives.
This post gives you some guidelines to perform linear regression:
- It can be crucial to examine linearity through graphical evaluation and to review the correlation between the response variable and the predictors.
- Examining correlations amongst variables helps reduce multicollinearity and supports variable selection.
- When two predictors are highly correlated, they could convey redundant information. In such cases, you’ll be able to retain the one which is more strongly correlated with the response, or — based on domain expertise — the one with greater business relevance or practical interpretability.
- The Variance Inflation Factor (VIF) is a great tool to quantify and assess multicollinearity.
- Before interpreting the model coefficients statistically, it is important to confirm the autocorrelation, normality, and homoscedasticity of the residuals to be certain that the model assumptions are met.
While this evaluation provides precious insights, it also has certain limitations.
The absence of missing values within the dataset simplifies the study, but this isn’t the case in real-world scenarios.
When you’re constructing a predictive model, it’s essential to split the information into training, testing, and potentially an out-of-time validation set to make sure robust evaluation.
For variable selection, techniques reminiscent of stepwise selection and other feature selection methods might be applied.
When comparing multiple models, it’s essential to define appropriate performance metrics.
Within the case of linear regression, commonly used metrics include the Mean Absolute Error (MAE) and the Mean Squared Error (MSE).
Image Credits
All images and visualizations in this text were created by the creator using Python (pandas, matplotlib, seaborn, and plotly) and excel, unless otherwise stated.
References
Wasserman, L. (2013). . Springer Science & Business Media.
Data & Licensing
The dataset utilized in this text accommodates crime-related and demographic statistics for 47 U.S. states in 1960.
It originates from the FBI’s Uniform Crime Reporting (UCR) Program and extra U.S. government sources.
As a U.S. government work, the information is in the general public domain under 17 U.S. Code § 105 and is free to make use of, share, and reproduce without restriction.
Sources:
Codes
Import data
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load the dataset
df = pd.read_csv('data/Multiple_Regression_Dataset.csv')
df.head()
Create a brand new figure
# Extract response variable and covariates
response = 'R'
covariates = [col for col in df.columns if col != response]
fig, axes = plt.subplots(nrows=4, ncols=4, figsize=(20, 18))
axes = axes.flatten()
# Plot boxplot for binary variable 'S'
sns.boxplot(data=df, x='S', y='R', ax=axes[0])
axes[0].set_title('Boxplot of R by S')
axes[0].set_xlabel('S')
axes[0].set_ylabel('R')
# Plot regression lines for all other covariates
plot_index = 1
for cov in covariates:
if cov != 'S':
sns.regplot(data=df, x=cov, y='R', ax=axes[plot_index], scatter=True, line_kws={"color": "red"})
axes[plot_index].set_title(f'{cov} vs R')
axes[plot_index].set_xlabel(cov)
axes[plot_index].set_ylabel('R')
plot_index += 1
# Hide unused subplots
for i in range(plot_index, len(axes)):
fig.delaxes(axes[i])
fig.tight_layout()
plt.show()
Evaluation of the correlation between variables
spearman_corr = df.corr(method='spearman')
plt.figure(figsize=(12, 10))
sns.heatmap(spearman_corr, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Correlation Matrix Heatmap")
plt.show()
Filtering Predictors with High Intercorrelation (ρ > 0.6)
# Step 2: Correlation of every variable with response R
spearman_corr_with_R = spearman_corr['R'].drop('R') # exclude R-R
# Step 3: Discover pairs of covariates with strong inter-correlation (e.g., > 0.9)
strong_pairs = []
threshold = 0.6
covariates = spearman_corr_with_R.index
for i, var1 in enumerate(covariates):
for var2 in covariates[i+1:]:
if abs(spearman_corr.loc[var1, var2]) > threshold:
strong_pairs.append((var1, var2))
# Step 4: From each correlated pair, keep only the variable most correlated with R
to_keep = set()
to_discard = set()
for var1, var2 in strong_pairs:
if abs(spearman_corr_with_R[var1]) >= abs(spearman_corr_with_R[var2]):
to_keep.add(var1)
to_discard.add(var2)
else:
to_keep.add(var2)
to_discard.add(var1)
# Final selection: all covariates excluding those to discard attributable to redundancy
final_selected_variables = [var for var in covariates if var not in to_discard]
final_selected_variables
Evaluation of multicollinearity using VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
from sklearn.preprocessing import StandardScaler
X = df[final_selected_variables]
X_with_const = add_constant(X)
vif_data = pd.DataFrame()
vif_data["variable"] = X_with_const.columns
vif_data["VIF"] = [variance_inflation_factor(X_with_const.values, i)
for i in range(X_with_const.shape[1])]
vif_data = vif_data[vif_data["variable"] != "const"]
print(vif_data)
Fit a linear regression model on six variables after standardization, not splitting the information into train and test
from sklearn.preprocessing import StandardScaler
from statsmodels.api import OLS, add_constant
import pandas as pd
# Variables
X = df[final_selected_variables]
y = df['R']
scaler = StandardScaler()
X_scaled_vars = scaler.fit_transform(X)
X_scaled_df = pd.DataFrame(X_scaled_vars, columns=final_selected_variables)
X_scaled_df = add_constant(X_scaled_df)
model = OLS(y, X_scaled_df).fit()
print(model.summary())
