Home Artificial Intelligence Mastering Linear Regression: The Definitive Guide For Aspiring Data Scientists What can we mean by “regression evaluation”? Understanding correlation The difference between correlation and regression The Linear Regression model Assumptions for the Linear Regression model Finding the road that most closely fits the information Graphical methods to validate your ML model An example in Python Conclusions

Mastering Linear Regression: The Definitive Guide For Aspiring Data Scientists What can we mean by “regression evaluation”? Understanding correlation The difference between correlation and regression The Linear Regression model Assumptions for the Linear Regression model Finding the road that most closely fits the information Graphical methods to validate your ML model An example in Python Conclusions

0
Mastering Linear Regression: The Definitive Guide For Aspiring Data Scientists
What can we mean by “regression evaluation”?
Understanding correlation
The difference between correlation and regression
The Linear Regression model
Assumptions for the Linear Regression model
Finding the road that most closely fits the information
Graphical methods to validate your ML model
An example in Python
Conclusions

Image by Dariusz Sankowski on Pixabay

If you’re approaching Machine Learning, one in all the primary models you might encounter is Linear Regression. It’s probably the simplest model to grasp, but don’t underestimate it: there are numerous things to grasp and master.

When you’re a beginner in Data Science or an aspiring Data Scientist, you’re probably facing some difficulties because there are numerous resources on the market, but are fragmented. I know the way you’re feeling, and because of this I created this entire guide: I need to present you all of the knowledge you would like without trying to find the rest.

So, if you desire to have complete knowledge of Linear Regression this text is for you. You’ll be able to study it deeply and re-read it every time you would like it essentially the most. Also, consider that, to cover this topic, we’ll need some knowledge generally related to regression evaluation: we’ll cover it in deep.

And…you’ll excuse me if I’ll link a resource you’ll need: up to now, I’ve created an article on some topics related to Linear Regression so, to have an entire overview, I counsel you to read it (I’ll link later once we’ll need it).

What can we mean by "regression evaluation"?
Understanding correlation
The difference between correlation and regression
The Linear Regression model
Assumptions for the Linear Regression model
Finding the road that most closely fits the information
Graphical methods to validate your model
An example in Python

Here we’re studying Linear Regression, but what can we mean by “regression evaluation”? Paraphrasing from Wikipedia:

Regression evaluation is a mathematical technique used to search out a functional relationship between a dependent variable and a number of independent variable(s).

In other words, we all know that in mathematics we are able to define a function like so: y=f(x). Generally, y is known as the dependent variable and x the independent. So, we express y in relationship with x, using a certain function f. The aim of regression evaluation is, then, to search out the function f .

Now, this seems easy but just isn’t. And I do know you recognize it. And the rationale why just isn’t easy is:

  • We all know x and y. For instance, if we’re working with tabular data (with Pandas, for instance) x are the features and y is the label.
  • Unfortunately, the information rarely follow a really clear path. So our job is to search out one of the best function f that the connection between x and y.

So, let me summarize it: regression evaluation goals to search out an estimated relationship (a very good one!) between the dependent and the independent variable(s).

Now, let’s visualize why this process could also be difficult. Consider the next code and its end result:

import numpy as np
import matplotlib.pyplot as plt

# Create random linear data
a = 130

x = 6*np.random.rand(a,1)-3
y = 0.5*x+5+np.random.rand(a,1)

# Labels
plt.xlabel('x')
plt.ylabel('y')

# Plot a scatterplot
plt.scatter(x,y)

The end result of the above code. Image by Creator.

Now, tell me: can the connection between x and y be a line? So…can this data be approximated by a line? Just like the following, for instance:

A line approximating the given data. Image by Creator.

Stop reading for a moment and take into consideration that.

Well, it could. And the way in regards to the following one?

A curve approximating the given data. Image by Creator.

Well, even this might! So, what’s one of the best one? And why not one other one?

That is the aim of regression: to search out the best-estimated function that may approximate the given data. And it does so using some methodologies: we’ll cover them later in this text. We’ll apply them to the Linear Regression model but a few of them might be used with every other regression technique. Don’t worry: I’ll be very specific so that you don’t get confused.

Quoting from Wikipedia:

In statistics, correlation is any statistical relationship, whether causal or not, between two random variables. Although within the broadest sense, “correlation” may indicate any style of association, in statistics it often refers back to the degree to which a pair of variables are linearly related.

In other words, is a statistical measure that expresses the .

We are able to say that two variables are correlated if each value of the primary variable corresponds to a price for the second variable, following a path. If two variables are highly correlated, the trail can be linear, since the correlation describes the linear relation between the variables.

The mathematics behind the correlation

This can be a comprehensive guide, as promised. So, I need to cover the mathematics behind the correlation, but don’t worry: we’ll make it easy so which you can understand it even in case you’re not specialized in math.

We generally consult with the correlation coefficient, also referred to as the . This provides an estimate of the correlation between two variables. Suppose we’ve two variables, a and b and so they can reach n values. We are able to calculate the correlation coefficient as follows:

The definition of the Pearson coefficient, powered by embed-dot-fun by the Creator.

Where we’ve:

  • the mean value of a(however it applies to each variables, a and b):
The definition of the mean value, powered by embed-dot-fun by the Creator.
The definitions of the usual deviation and the variance, powered by embed-dot-fun by the Creator.

So, putting all of it together:

The definition of the Pearson coefficient, powered by embed-dot-fun by the Creator.

As you might know:

  • the is the sum of all of the values of a variable divided by the variety of values. So, for instance, if our variable a has the values 1,3,7,13,25 the mean value of a can be:
The calculation of the mean for five values, powered by embed-dot-fun by the Creator.
  • the is an index of statistical dispersion and is an estimate of the variability of a variable (or of a population, as we might say in statistics). It’s one in all the ways to specific the dispersion of information around an index; within the case of the correlation coefficient, the index around which we calculate the dispersion is the mean (see the above formula). The more the usual deviation is high, the more the dispersion across the mean is high: the vast majority of the information points are distant from the mean value.

Numerically speaking, we’ve to keep in mind that the worth of the correlation coefficient is constrained between 1 and -1; which means:

  • if r=1: the variables are highly positively correlated; it implies that if one variable increases its value, the opposite does the identical, following a linear path.
  • if r=-1: the variables are highly negatively correlated; it implies that if one variable increases its value, the opposite one decreases its value, following a linear path.
  • if r=0there isn’t any correlation between the variables.

Finally, two variables are generally considered highly correlated if r>0.75.

Correlation just isn’t causation

We’d like to have very clear in our mind the incontrovertible fact that “”; we need to make an example that may be useful to recollect it.

It’s a hot summer; we don’t just like the high temperatures in our city, so we go to the mountain. Luckily, we get to the mountain top, measure the temperature and find it’s lower than in our city. We get just a little suspicious, and we resolve to go to the next mountain, finding that the temperature is even lower than the one on the previous mountain.

We try mountains with different heights, measure the temperature, and plot a graph; we discover that with the peak of the mountain increasing, the temperature decreases, and we are able to see a linear trend.

What does it mean? It implies that the temperature is expounded to the peak of the mountains, with a linear path: so there’s a correlation between the decrease in temperature and the peak (of the mountains). It doesn’t mean the peak of the mountain caused the decrease in temperature; in actual fact, if we get to the identical height, at the identical latitude, with a hot air balloon we’d measure the identical temperature.

The correlation matrix

So, how can we calculate the correlation coefficient in Python? Well, we generally calculate the correlation matrix. Suppose we’ve two variables, X and y; we store them in a knowledge frame called df and we are able to plot the correlation matrix using seaborn like so:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Create data
x = np.array([1, 1, 2, 3, 4, 4, 5, 6, 7, 7, 8, 9])
y = np.array([13, 14, 17, 12, 23, 24, 25, 25, 24, 28, 32, 33])

# Create the dataframe
df = pd.DataFrame({'x':x, 'y':y})

# Plot heat map for correlation coefficient
sns.heatmap(df.corr(), annot=True, fmt="0.2")

The correlation matrix for the above code. Image by Creator.

If we’ve a 0 correlation coefficient, it implies that the information points don’t are inclined to increase or decrease following a linear path, because we’ve no correlation.

Allow us to have a have a look at some plots of correlation coefficients with different values (image from Wikipedia here):

Data distribution with different correlation values. Image rights for distribution here.

As we are able to see, when the correlation coefficient is the same as 1 or -1 the tendency of the information points is clearly to be along a line. But, because the correlation coefficient deviates from the 2 extreme values, the distribution of the information points deviates from a linear path. Finally, for the correlation coefficient of 0, the distribution of the information might be anything.

So, once we get a correlation coefficient of 0 we are able to’t say anything in regards to the distribution of the information, but we are able to investigate it (if needed) with a regression evaluation.

So, correlation and regression are linked but are different:

  • Correlation analyzes the tendency of variables to be linearly distributed.
  • Regression is the study of the connection between variables.

We’ve got two sorts of Linear Regression models: the Easy and the Multiple ones. Let’s see them each.

The Easy Linear Regression model

The goal of the Easy Linear Regression is to model the connection between a single feature and a continuous label. That is the mathematical equation that describes this ML model:

y = wx + b

The parameter b (also called “bias”) represents the y-axis intercept (is the worth of ywhen X=0), and w is the load coefficient. Our goal is to learn the load w that describes the connection between x and y. This weight will later be used to predict the response for brand new values of x.

Let’s consider a practical example:

import numpy as np
import matplotlib.pyplot as plt

# Create data
x = np.array([1, 1, 2, 3, 4, 4, 5, 6, 7, 7, 8, 9])
y = np.array([13, 14, 17, 12, 23, 24, 25, 25, 24, 28, 32, 33])

# Show scatterplot
plt.scatter(x, y)

The output of the above code. Image by Creator.

The query is: can this data distribution be approximated with a line? Well, we could create something like that:

import numpy as np
import matplotlib.pyplot as plt

# Create data
x = np.array([1, 1, 2, 3, 4, 4, 5, 6, 7, 7, 8, 9])
y = np.array([13, 14, 17, 12, 23, 24, 25, 25, 24, 28, 32, 33])

# Create basic scatterplot
plt.plot(x, y, 'o')

# Obtain m (slope) and b (intercept) of a line
m, b = np.polyfit(x, y, 1)

# Add linear regression line to scatterplot
plt.plot(x, m*x+b)

# Labels
plt.xlabel('x variable')
plt.ylabel('y variable')

The output of the above code. Image by Creator.

Well, as in the instance we’ve seen above, it might be a line however it might be a general curve.

And, in a moment we’ll see how we are able to say if the information distribution might be higher described by a line or by a general curve.

The Multiple Linear Regression model

Since reality is complex, the everyday cases we’ll face are related to the Multiple Linear Regression case. We mean that the feature x just isn’t a single one: we’ll have multiple features. For instance, if we work with tabular data, a knowledge frame with 9 columns has 8 features and 1 label: which means our problem is eight-dimensional.

As we are able to understand, this case could be very complicated to visualise and the equation of the road must be expressed with vectors and matrices, becoming:

The equation of the Multiple Linear Regression model powered by embed-dot-fun by the Creator.

So, the equation of the road becomes the sum of all of the weights (w) multiplied by the independent variable (x) and it could actually even be written because the product of two matrices.

Now, to use the Linear Regression model, our data should respect some assumptions. These are:

  1. : the connection between the dependent variable and independent variables ought to be linear. Which means that a change within the independent variable should lead to a proportional change within the dependent variable, following a linear path.
  2. : the observations within the dataset ought to be independent of one another. Which means that the worth of 1 commentary shouldn’t rely on the worth of one other commentary.
  3. : the variance of the residuals ought to be constant across all levels of the independent variable. In other words, the spread of the residuals ought to be roughly the identical across all levels of the independent variable.
  4. : the residuals ought to be normally distributed. In other words, the distribution of the residuals ought to be a standard (or bell-shaped) curve.
  5. : the independent variables shouldn’t be highly correlated with one another. If two or more independent variables are highly correlated, it could actually be difficult to differentiate the person effects of every variable on the dependent variable.

Unfortunately, testing all these hypotheses just isn’t at all times possible, especially within the case of the Multiple Linear Regression model. Anyway, there’s a option to test all of the hypotheses. It’s called the p-value test, and perhaps you heard of that before. Anyway, we won’t cover this test here for 2 reasons:

  1. It’s a general test, not specifically related to the Linear Regression model. So, it needs a selected treatment in a dedicated article.
  2. I’m one in all those (perhaps one in all the few) who believes that calculating the p-value just isn’t at all times a must when we want to investigate data. For that reason, I’ll create in the long run a dedicated article on this controversial topic. But only for the sake of curiosity, since I’m an engineer I even have a really practical approach, and I like applied mathematics. I wrote an article on this topic here:

So, above we were reasoning which one in all the next might be one of the best fit:

A comparison between models. Image by Creator.

To know if one of the best model is the left one (the road) or the fitting one (a general curve) we proceed as follows:

  • We split the information we’ve into the training and the test set.
  • We validate each models on each sets, testing how well our models generalize their learning.

We won’t cover the polynomial model here (useful for general curves), but consider that there are two approaches to validate ML models:

  • The analytical one.
  • The graphical one.

Generally speaking, we’ll use each to get a greater understanding of the performance of the model. Anyway, implies that our ML model learns from the training set and . If it doesn’t, we try one other ML model. Here’s the method:

The workflow of coaching and validating ML models. Image by Creator.

Which means that .

I’ve discussed the analytical option to validate an ML model within the case of linear regression in the next article:

I counsel you to read it because we’ll use some metrics discussed there in the instance at the top of this text.

After all, the metrics discussed might be applied to any ML model within the case of a regression problem. But you’re lucky: I’ve used the linear model for instance.

The graphical ways to validate an ML model within the case of a regression problem are discussed in the following paragraph.

Let’s see three graphical ways to validate our ML models.

1. The residual evaluation plot

This method is restricted to the Linear Regression model and consists in visualizing how the residuals are distributed. Here’s what we expect:

A residual evaluation plot. Image by Creator.

To plot this we are able to use the built-in function sns.residplot() in Seaborn (here’s the documentation).

A plot like that is sweet because we would like to see randomly distributed data points along the horizontal axis. One in every of the , in actual fact, is that the (assumption n°4 listed above). If the residuals are normally distributed, it implies that the errors of the observed values from the anticipated ones are randomly distributed around zero, with no clear pattern or trend; and this is precisely the case in our plot. So, in these cases, our ML model could also be a very good one.

As a substitute, if there’s a specific pattern in our residual plot, our model just isn’t good for our ML problem. For instance, consider the next:

A parabolical residuals evaluation plot. Image by Creator.

On this case, we are able to see that there’s a parabolic trend: which means our model (the Linear model) just isn’t good to unravel our ML problem.

2. The actual vs. predicted values plot

One other plot we may use to validate our ML model is the . On this case, we plot a graph having the actual values on the horizontal axis and the anticipated values on the vertical axis. The goal is to search out the information points distributed as much as possible to a line, within the case of Linear Regression. We are able to even use the strategy within the case of a polynomial regression: on this case, we’d expect the information distributed as much as possible to a generic curve.

Suppose we’ve a result as follows:

An actual vs. predicted values plot within the case of linear regression. Image by Creator.

The above graph shows that the anticipated data points are distributed along a line. It just isn’t an ideal linear distribution, so the linear model will not be ideal.

If, for our specific problem, we’vey_train (the label on the training set) and we’ve calculated y_train_pred (the prediction on the training set), we are able to plot the next graph like so:

import matplotlib.pyplot as plt

# Scatterplot of y_train and y_train_pred
plt.scatter(y_train, y_train_pred)
plt.plot(y_test, y_test, color='r') # Plot the road

# Labels
plt.title('ACTUAL VS PREDICTED VALUES')
plt.xlabel('ACTUAL VALUES')
plt.ylabel('PREDICTED VALUES')

3. The Kernel Density Estimation (KDE) plot

The last graph we would like to discuss to validate our ML models is the Kernel Density Estimation (KDE) plot. This can be a general method and might be used to validate each regression and classification models.

The KDE is the appliance of a for probability density estimation. A kernel smoother is a statistical method that’s used to estimate a function because the weighted average of the neighbor observed data. The kernel defines the load, giving the next weight to closer data points.

To know the usefulness of a smoother function, see the graph below:

The concept behind KDE. Image by Creator.

It is useful to approximate our data points with a smoothing function if we would like to check two quantities. Within the case of an ML problem, in actual fact, we typically prefer to see the comparison between the actual labels and the labels predicted by our model, so we use the KDE to check two smoothed functions.

Let’s say we’ve predicted our labels using a linear regression model. We wish to check the KDE for our training set’s actual and predicted labels. We are able to achieve this with Seaborn invoking the strategy sns.kdeplot() (here’s the documentation).

Suppose we’ve the next result:

A KDE plot. Image by Creator.

As we are able to see, the comparison between the actual and the anticipated label is straightforward to do, since we’re comparing two smoothed functions; in a case like that, our model is sweet since the curves are very similar.

Actually, what we expect from a “good” ML model are:

  1. The curves are just like bell curves, as much as possible.
  2. The 2 curves are similar between them, as much as possible.

Now, let’s apply all of the things we’ve learned to date here. We’ll use the famous “Ames Housing” dataset, which is ideal for our scopes.

This dataset has 80 features, but for simplicity, we’ll work with only a subset of them that are:

  • Overall Qual: it’s the rating of the general material and finish of the home on a scale from 1 (bad) to 10 (excellent).
  • Overall Cond: it’s the rating of the general condition of the home on a scale from 1 (bad) to 10 (excellent).
  • Gr Liv Area: it’s the above-ground living area, measured in squared feet.
  • Total Bsmt SF: it’s the full basement area, measured in squared feet.
  • SalePrice: it’s the sale price, in USD $.

We’ll consider our SalePrice column because the goal (label) variable, and the opposite columns because the features.

Exploratory Data Evaluation EDA

Let’s import our data, create a subset with the mentioned features, and display some statistics:

import pandas as pd

# Define the columns
columns = ['Overall Qual', 'Overall Cond', 'Gr Liv Area',
'Total Bsmt SF', 'SalePrice']

# Create dataframe
df = pd.read_csv('http://jse.amstat.org/v19n3/decock/AmesHousing.txt',
sep='t', usecols=columns)

# Show statistics
df.describe()

Statistics of the dataset. Image by Creator.

A crucial commentary here is that the mean values for all labels have a unique range (the Overall Qual mean value is 6.09 while Gr Liv Area mean value is 1499.69). This tells us a crucial fact: we’ve to scale the features.

Data preparation

What does “” mean?

Scaling a feature implies that the feature range is scaled between 0 and 1 or between 1 and -1. There are two typical methods to scale the features:

  • Mean normalization is a technique of scaling numeric data in order that it has a minimum value of zero and a maximum value of every one the values are normalized across the mean value. Suppose c is a price reached by our feature; to scale across the mean (c′ is the brand new value of c after the normalization process):
The formula for the mean normalization, powered by embed-dot-fun by the Creator.

Let’s see an example in Python:

import numpy as np

# Create a listing of numbers
data = [1, 2, 3, 4, 5]

# Find min and max values
data_min = min(data)
data_max = max(data)

# Normalize the information
data_normalized = [(x - data_min) / (data_max - data_min) for x in data]

# Print the normalized data
print(f'normalized data: {data_normalized}')

>>>

normalized data: [0.0, 0.25, 0.5, 0.75, 1.0]

  • (or z-score normalization): This method transforms a variable in order that it has a mean of zero and a regular deviation of 1. The formula is the next (c′c’c′ is the brand new value of ccc after the normalization process):
The formula for the standardization, powered by embed-dot-fun by the Creator.

Let’s see an example in Python:

import numpy as np

# Original data
data = [1, 2, 3, 4, 5]

# Calculate mean and standard deviation
mean = np.mean(data)
std = np.std(data)

# Standardize the information
data_standardized = [(x - mean) / std for x in data]

# Print the standardized data
print(f'standardized values: {data_standardized}')
print(f'mean of standardized values: {np.mean(data_standardized)}')
print(f'std. dev. of standardized values: {np.std(data_standardized): .2f}')

>>>

standardized values: [-1.414213562373095, -0.7071067811865475, 0.0, 0.7071067811865475, 1.414213562373095]
mean of standardized values: 0.0
std. dev. of standardized values: 1.00

As we are able to see, the normalized data have a mean of 0 and a regular deviation of 1, as we wanted. The excellent news is that we are able to use the library scikit-learn to standardize the features, and we will do it in a moment.

Features scaling is a crucial thing to do when working on an ML problem, for a straightforward reason:

  • If we perform exploratory data evaluation with features that are usually not scaled, when calculating the mean values (for instance, through the calculation of the coefficient of correlation) we’ll get numbers which might be very different from one another. If we take a have a look at the statistics we’ve got above once we’ve invoked the df.describe() method, we are able to see that, for every column, we get a really different value of the mean. If we scale or normalize the features, as a substitute, we’ll get 0s, 1s, and -1s: and this can help us mathematically.

Now, this dataset has some NaN values. We won’t show it for brevity (try it on your individual), but we’ll remove them. Also, we’ll calculate the correlation matrix:

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Drop NaNs from dataframe
df = df.dropna(axis=0)

# Apply mask
mask = np.triu(np.ones_like(df.corr()))

# Heat map for correlation coefficient
sns.heatmap(df.corr(), annot=True, fmt="0.1", mask=mask)

The correlation matrix for our data frame. Image by Creator.

So, with np.triu(np.ones_like(df.corr())) we’ve created a mask that it’s useful to display a triangular correlation matrix, which is more readable (especially when we’ve rather more features than on this case).

So, there’s a moderate 0.6 correlation between Total Bsmt SF and SalePrice, quite a high 0.7 correlation between Gr Liv Area and SalePrice, and a high correlation 0.8 between Overall Qual and SalePrice; Also, there’s a moderate correlation between Overall Qual and Gr Liv Area 0.6 and 0.5 between Overall Qual and Total Bsmt SF.

Here there’s no multicollinearity, so no features are highly correlated with one another (so, our features satisfy the hypothesis n°5 listed above). If we’d found some highly correlated features, we could delete them because ().

Finally, we subdivide the information frame dfinto X ( the features) and y(the label) and scale the features:

from sklearn.preprocessing import StandardScaler

# Define the features
X = df.iloc[:,:-1]

# Define the label
y = df.iloc[:,-1]

# Scale the features
scaler = StandardScaler() # Call the scaler
X = scaler.fit_transform(X) # Fit the features to scale them

Fitting the linear regression model

Now we’ve to separate the features X into the training and the test set and we’re fitting them with the Linear Regression model. Then, we calculate R² for each sets:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Fit the LR model
reg = LinearRegression().fit(X_train, y_train)

# Calculate R^2
coeff_det_train = reg.rating(X_train, y_train)
coeff_det_test = reg.rating(X_test, y_test)

# Print metrics
print(f" R^2 for training set: {coeff_det_train}")
print(f" R^2 for test set: {coeff_det_test}")

>>>

R^2 for training set: 0.77
R^2 for test set: 0.73


1) your results might be barely different on account of the stocastical
nature of the ML models.

2) here we are able to see generalization on motion:
we fitted the Linear Regression model to the train set with
reg = LinearRegression().fit(X_train, y_train).
The, we have calculated R^2 on the training and test sets with:
coeff_det_train = reg.rating(X_train, y_train)
coeff_det_test = reg.rating(X_test, y_test

In other words: we do not fit the information to the test set.
We fit the information to the training set and we calculate the scores
and predictions (see next snippet of code with KDE) on each sets
to see the generalization of our modelon recent unseen data
(the information of the test set).

So we get R² of 0.77 on the training test and 0.73 on the test set that are quite good, suggesting the Linear model is a very good one to unravel this ML problem.

Let’s see the KDE plots for each sets:

# Calculate predictions
y_train_pred = reg.predict(X_train) # train set
y_test_pred = reg.predict(X_test) # test set

# KDE train set
ax = sns.kdeplot(y_train, color='r', label='Actual Values') #actual values
sns.kdeplot(y_train_pred, color='b', label='Predicted Values', ax=ax) #predicted values

# Show title
plt.title('Actual vs Predicted values')
# Show legend
plt.legend()

KDE for the training set. Image by Creator.
# KDE test set
ax = sns.kdeplot(y_test, color='r', label='Actual Values') #actual values
sns.kdeplot(y_test_pred, color='b', label='Predicted Values', ax=ax) #predicted values

# Show title
plt.title('Actual vs Predicted values')
# Show legend
plt.legend()

KDE for the test set. Image by Creator.

Whatever the incontrovertible fact that we’ve obtained an R² of 0.73 on the test set which is sweet (but remember: the upper, the higher), this plot shows us that the linear model is indeed a very good model to unravel this ML problem. For this reason I really like the KDE plot: is a really powerful tool, as we are able to see.

Also, this shows why shouldn’t depend on only one method to validate our ML model: a mixture of 1 analytical method with one graphical one generally gives us the fitting insights to choose whether to vary our ML model or not. On this case, the Linear Regression model is ideal to make predictions.

I hope you’ll find useful this text. I understand it’s very long, but I wanted to present you all of the knowledge you would like on this topic, so which you can return to it every time you would like it essentially the most.

A number of the things we’ve discussed listed here are general topics, while others are specific to the Linear Regression model. Let’s summarize them:

  • The definition of is, in fact, a general definition.
  • is usually known as the Linear modelIn fact, as we said before, correlation is the tendency of two variables to be linearly dependent.Howeverthere are ways to define non-linear correlations, but we leave them for other articles (but, as knowledge for you: just consider that they exist).
  • We’ve discussed the Easy and the Multiple Linear Regression models with their assumptions (the assumptions apply to each models).
  • When talking about the right way to find the road that most closely fits the information, we’ve referred to the article “Mastering the Art of Regression Evaluation: 5 Key Metrics Every Data Scientist Should Know”. Here, we discover all of the metrics to know to unravel a regression evaluation. So, this can be a generical topic that applies to any regression model, including the Linear one, in fact.
  • We’ve shown three methods to validate our ML models: 1) : which applies to Linear Regression models, 2) : which might be applied to Linear and Polynomial models, 3) the : this might be applied to any ML model, even within the case of a classification problem

Finally, I need to remind you that we’ve spent a few lines stressing the incontrovertible fact that we are able to avoid using p-values to check the hypotheses of our ML models. I’m writing an article on this topic very soon, but, as you possibly can see, the KDE has shown us that our Linear model is sweet to unravel this ML problem, and we haven’t validated our hypothesis with p-values.

To this point in this text, we’ve used some plots. You’ll be able to clone this repo I’ve created so which you can import the code and use it to simply plot the graphs. If you could have some difficulties, you discover examples of usages on my projects on GitHub. If you could have every other difficulties, you possibly can contact me and I’ll show you how to.

  • Subscribe to my newsletter to get more on Python & Data Science.
  • Found it useful?
  • Liked the article? Join Medium through : unlock all of the content on Medium for five$/month (with no additional fee).
  • Find/contact me here.

LEAVE A REPLY

Please enter your comment!
Please enter your name here