The Pearson Correlation Coefficient, Explained Simply

-

construct a regression model, which implies fitting a straight line on the information to predict future values, we first visualize our data to get an idea of the way it looks and to see the patterns and relationships.

The info may appear to indicate a positive linear relationship, but we confirm it by calculating the Pearson correlation coefficient, which tells us how close our data is to linearity.

Let’s consider a straightforward Salary Dataset to know the Pearson correlation coefficient.

The dataset consists of two columns:

YearsExperience: the variety of years an individual has been working

Salary (goal): the corresponding annual salary in US dollars

Now we’d like to construct a model that predicts salary based on years of experience.

We will understand that this might be done with a straightforward linear regression model because we now have just one predictor and a continuous goal variable.

But can we directly apply the easy linear regression algorithm similar to that?

No.

We’ve got several assumptions for linear regression to use, and one in every of them is linearity.

We want to examine linearity, and for that, we calculate the correlation coefficient.


But what’s linearity?

Let’s understand this with an example.

Image by Creator

From the table above, we are able to see that for each one-year increase in experience, there’s a $5,000 increase in salary.

The change is constant, and once we plot these values, we get a straight line.

Any such relationship known as a linear relationship.


Now in easy linear regression, we already know that we fit a regression line on the information to predict future values, and this might be effective only when the information has a linear relationship.

So, we’d like to examine for linearity in our data.

For that, let’s calculate the correlation coefficient.

Before that, we first visualize the information using a scatter plot to get an idea of the connection between the 2 variables.

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Load the dataset
df = pd.read_csv("C:/Salary_dataset.csv")

# Set plot style
sns.set(style="whitegrid")

# Create scatter plot
plt.figure(figsize=(8, 5))
sns.scatterplot(x='YearsExperience', y='Salary', data=df, color='blue', s=60)

plt.title("Scatter Plot: Years of Experience vs Salary")
plt.xlabel("Years of Experience")
plt.ylabel("Salary (USD)")
plt.tight_layout()
plt.show()
Image by Creator

We will observe from the scatter plot that as years of experience increases, salary also tends to extend.

Although the points don’t form an ideal straight line, the connection appears to be strong and linear.

To substantiate this, let’s now calculate the Pearson correlation coefficient.

import pandas as pd

# Load the dataset
df = pd.read_csv("C:/Salary_dataset.csv")

# Calculate Pearson correlation
pearson_corr = df['YearsExperience'].corr(df['Salary'], method='pearson')

print(f"Pearson correlation coefficient: {pearson_corr:.4f}")

Pearson correlation coefficient is 0.9782.

We get the worth of correlation coefficient in between -1 and +1.

Whether it is…
near 1: strong positive linear relationship
near 0: no linear relationship
near -1: strong negative linear relationship

Here, we got a correlation coefficient value of 0.9782, which implies the information mostly follows a straight-line pattern, and there’s a very strong positive relationship between the variables.

From this, we are able to observe that easy linear regression is well suited for modeling this relationship.


But how can we calculate this Pearson correlation coefficient?

Let’s consider a 10-point sample data from our dataset.

Image by Creator

Now, let’s calculate the Pearson correlation coefficient.

When each X and Y increase together, the correlation is alleged to be positive. Then again, if one increases while the opposite decreases, the correlation is negative.

First, let’s calculate the variance for every variable.

Variance helps us understand how far the values are spread from the mean.

We’ll start by calculating the variance for X (Years of Experience).
To try this, we first must compute the mean of X.

[
bar{X} = frac{1}{n} sum_{i=1}^{n} X_i
]

[
= frac{1.2 + 3.3 + 3.8 + 4.1 + 5.0 + 5.4 + 8.3 + 8.8 + 9.7 + 10.4}{10}
]
[
= frac{70.0}{10}
]
[
= 7.0
]

Next, we subtract each value from the mean after which square it to cancel out the negatives.

Image by Creator

We’ve calculated the squared deviations of every value from the mean.
Now, we are able to find the variance of X by taking the typical of those squared deviations.

[
text{Sample Variance of } X = frac{1}{n – 1} sum_{i=1}^{n} (X_i – bar{X})^2
]

[
= frac{33.64 + 13.69 + 10.24 + 8.41 + 4.00 + 2.56 + 1.69 + 3.24 + 7.29 + 11.56}{10 – 1}
]
[
= frac{96.32}{9} approx 10.70
]

Here we divided by ‘n-1’ because we’re coping with a sample data and using ‘n-1’ gives us the unbiased estimate of variance.

The sample variance of X is 10.70, which tells us that the values of Years of Experience are, on average, 10.70 squared units away from the mean.

Since variance is a squared value, we take the square root to interpret it in the identical unit as the unique data.

This known as Standard Deviation.

[
s_X = sqrt{text{Sample Variance}} = sqrt{10.70} approx 3.27
]

The usual deviation of X is 3.27, which implies that the values of Years of Experience fall about 3.27 years above or below the mean.


In the identical way we calculate the variance and standard deviation of ‘Y’.

[
bar{Y} = frac{1}{n} sum_{i=1}^{n} Y_i
]

[
= frac{39344 + 64446 + 57190 + 56958 + 67939 + 83089 + 113813 + 109432 + 112636 + 122392}{10}
]
[
= frac{827239}{10}
]
[
= 82,!723.90
]
[
text{Sample Variance of } Y = frac{1}{n – 1} sum (Y_i – bar{Y})^2
]
[
= frac{7,!898,!632,!198.90}{9} = 877,!625,!799.88
]
[
text{Standard Deviation of } Y text{ is } s_Y = sqrt{877,!625,!799.88} approx 29,!624.75
]

We calculated the variance and standard deviation of ‘X’ and ‘Y’.

Now, the subsequent step is to calculate the covariance between X and Y.

We have already got the technique of X and Y, in addition to the deviations of every value from their respective means.

Now, we multiply these deviations to see how the 2 variables vary together.

Image by Creator

By multiplying these deviations, we are attempting to capture how X and Y move together.

If each X and Y are above their means, then the deviations are positive, which implies the product is positive.

If each X and Y are below their means, then the deviations are negative, but since a negative times a negative is positive, the product is positive.

If one is above the mean and the opposite is below, the product is negative.

This product tells us whether the 2 variables are likely to move within the same direction (each increasing or each decreasing) or in opposite directions.

Using the sum of the product of deviations, we now calculate the sample covariance.

[
text{Sample Covariance} = frac{1}{n – 1} sum_{i=1}^{n}(X_i – bar{X})(Y_i – bar{Y})
]

[
= frac{808771.5}{10 – 1}
]
[
= frac{808771.5}{9} = 89,!863.5
]

We got a sample covariance of 89863.5. This means that as experience increases, salary also tends to extend.

However the magnitude of covariance relies on the units of the variables (years × dollars), so it’s in a roundabout way interpretable.

This value only shows the direction.

Now we divide the covariance by the product of the usual deviations of X and Y.

This offers us the Pearson correlation coefficient which might be called as a normalized version of covariance.

Because the standard deviation of X has units of years and Y has units of dollars, multiplying them gives us years times dollars.

These units cancel out once we divide, leading to the Pearson correlation coefficient, which is unitless.

However the foremost reason we divide covariance by the usual deviations is to normalize it, so the result is less complicated to interpret and might be compared across different datasets.

[
r = frac{text{Cov}(X, Y)}{s_X cdot s_Y}
= frac{89,!863.5}{3.27 times 29,!624.75}
= frac{89,!863.5}{96,!992.13} approx 0.9265
]

So, the Pearson correlation coefficient (r) we calculated is 0.9265.

This tells us there’s a very strong positive linear relationship between years of experience and salary.

This fashion we discover the Pearson correlation coefficient.

The formula for Pearson correlation coefficient is:

[
r = frac{text{Cov}(X, Y)}{s_X cdot s_Y}
= frac{frac{1}{n – 1} sum_{i=1}^{n} (X_i – bar{X})(Y_i – bar{Y})}
{sqrt{frac{1}{n – 1} sum_{i=1}^{n} (X_i – bar{X})^2} cdot sqrt{frac{1}{n – 1} sum_{i=1}^{n} (Y_i – bar{Y})^2}}
]

[
= frac{sum_{i=1}^{n} (X_i – bar{X})(Y_i – bar{Y})}
{sqrt{sum_{i=1}^{n} (X_i – bar{X})^2} cdot sqrt{sum_{i=1}^{n} (Y_i – bar{Y})^2}}
]


We want to be certain that certain conditions are met before calculating the Pearson correlation coefficient:

  • The connection between the variables ought to be linear.
  • Each variables ought to be continuous and numeric.
  • There ought to be no strong outliers.
  • The info ought to be normally distributed.

Dataset

The dataset utilized in this blog is the Salary dataset.

It’s publicly available on Kaggle and is licensed under the Creative Commons Zero (CC0 Public Domain) license. This implies it will probably be freely used, modified, and shared for each non-commercial and industrial purposes without restriction.


I hope this gave you a transparent understanding of how the Pearson correlation coefficient is calculated and when it’s used.

Thanks for reading!

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x