you’re analyzing a small dataset:
[X = [15, 8, 13, 7, 7, 12, 15, 6, 8, 9]]
You must calculate some summary statistics to get an idea of the distribution of this data, so you utilize numpy to calculate the mean and variance.
import numpy as np
X = [15, 8, 13, 7, 7, 12, 15, 6, 8, 9]
mean = np.mean(X)
var = np.var(X)
print(f"Mean={mean:.2f}, Variance={var:.2f}")
Your output Looks like this:
Mean=10.00, Variance=10.60
Great! Now you have got an idea of the distribution of your data. Nevertheless, your colleague comes along and tells you that additionally they calculated some summary statistics on this same dataset using the next code:
import pandas as pd
X = pd.Series([15, 8, 13, 7, 7, 12, 15, 6, 8, 9])
mean = X.mean()
var = X.var()
print(f"Mean={mean:.2f}, Variance={var:.2f}")
Their output looks like this:
Mean=10.00, Variance=11.78
The means are the identical, however the variances are different! What gives?
This discrepancy arises because numpy and pandas use different default equations for calculating the variance of an array. This text will mathematically define the 2 variances, explain why they differ, and show easy methods to use either equation in several numerical libraries.
Two Definitions
There are two standard ways to calculate the variance, each meant for a distinct purpose. It comes all the way down to whether you’re calculating the variance of your entire population (the whole group you’re studying) or simply a sample (a smaller subset of that population you really have data for).
The population variance, , is defined as:
[sigma^2 = frac{sum_{i=1}^N(x_i-mu)^2}{N}]
While the sample variance, , is defined as:
[s^2 = frac{sum_{i=1}^n(x_i-bar x)^2}{n-1}]
Notice the 2 key differences between these equations:
- Within the numerator’s sum, is calculated using the population mean, , while is calculated using the sample mean, .
- Within the denominator, divides by the whole population size , while divides by the sample size minus one, .
It must be noted that the excellence between these two definitions matters probably the most for small sample sizes. As grows, the excellence between and becomes less and less important.
Why Are They Different?
When calculating the population variance, it’s assumed that you have got all the info. You already know the precise center (the population mean ) and exactly how far every point is from that center. Dividing by the whole number of information points gives the true, exact average of those squared differences.
Nevertheless, when calculating the sample variance, it isn’t assumed that you have got all the info so that you do not need the true population mean . As a substitute, you simply have an estimate of , which is the sample mean . Nevertheless, it seems that using the sample mean as a substitute of the true population mean tends to underestimate the true population variance on average.
This happens since the sample mean is calculated directly from the sample data, meaning it sits in the precise mathematical center of that specific sample. In consequence, the info points in your sample will all the time be closer to their very own sample mean than they’re to the true population mean, resulting in an artificially smaller sum of squared differences.
To correct for this underestimation, we apply what is named Bessel’s correction (named for German mathematician Friedrich Wilhelm Bessel), where we divide not by , but by the marginally smaller to correct for this bias, as dividing by a smaller number makes the ultimate variance barely larger.
Degrees of Freedom
So why divide by and never or or another correction that also increases the ultimate variance? That comes all the way down to an idea called the Degrees of Freedom.
The degrees of freedom refers back to the variety of independent values in a calculation which can be free to differ. For instance, imagine you have got a set of three numbers, . You have no idea what the values of those are but you do know that their sample mean .
- The primary number might be anything (let’s say 8)
- The second number is also anything (let’s say 15)
- Since the mean have to be 10, isn’t free to differ and have to be the one number such that , which on this case is 7.
So in this instance, despite the fact that there are 3 numbers, there are only two degrees of freedom, as enforcing the sample mean removes the flexibility of one in all them to be free to differ.
Within the context of variance, before making any calculations, we start with degrees of freedom (corresponding to our data points). The calculation of the sample mean essentially uses up one degree of freedom, so by the point the sample variance is calculated, there are degrees of freedom left to work with, which is why is what appears within the denominator.
Library Defaults and How one can Align Them
Now that we understand the mathematics, we will finally solve the mystery from the start of the article! numpy and pandas gave different results because they default to different variance formulas.
Many numerical libraries control this using a parameter called ddof, which stands for Delta Degrees of Freedom. This represents the worth subtracted from the whole variety of observations within the denominator.
- Setting
ddof=0divides the equation by , calculating the population variance. - Setting
ddof=1divides the equation by , calculating the sample variance.
These can be applied when calculating the usual deviation, which is just the square root of the variance.
Here’s a breakdown of how different popular libraries handle these defaults and the way you’ll be able to override them:
numpy
By default, numpy assumes you’re calculating the population variance (ddof=0). When you are working with a sample and want to use Bessel’s correction, you could explicitly pass ddof=1.
import numpy as np
X = [15, 8, 13, 7, 7, 12, 15, 6, 8, 9]
# Sample variance and standard deviation
np.var(X, ddof=1)
np.std(X, ddof=1)
# Population variance and standard deviation (Default)
np.var(X)
np.std(X)
pandas
By default, pandas takes the alternative approach. It assumes your data is a sample and calculates the sample variance (ddof=1). To calculate the population variance as a substitute, you could pass ddof=0.
import pandas as pd
X = pd.Series([15, 8, 13, 7, 7, 12, 15, 6, 8, 9])
# Sample variance and standard deviation (Default)
X.var()
X.std()
# Population variance and standard deviation
X.var(ddof=0)
X.std(ddof=0)
Python’s Built-in statistics Module
Python’s standard library doesn’t use a ddof parameter. As a substitute, it provides explicitly named functions so there is no such thing as a ambiguity about which formula is getting used.
import statistics
X = [15, 8, 13, 7, 7, 12, 15, 6, 8, 9]
# Sample variance and standard deviation
statistics.variance(X)
statistics.stdev(X)
# Population variance and standard deviation
statistics.pvariance(X)
statistics.pstdev(X)
R
In R, the usual var() and sd() functions calculate the sample variance and sample standard deviation by default. Unlike the Python libraries, R doesn’t have a built-in argument to simply swap to the population formula. To calculate the population variance, you could manually multiply the sample variance by .
X <- c(15, 8, 13, 7, 7, 12, 15, 6, 8, 9)
n <- length(X)
# Sample variance and standard deviation (Default)
var(X)
sd(X)
# Population variance and standard deviation
var(X) * ((n - 1) / n)
sd(X) * sqrt((n - 1) / n)
Conclusion
This text explored a frustrating yet often unnoticed trait of various statistical programming languages and libraries — they elect to make use of different default definitions of variance and standard deviation. An example was given where for a similar input array, numpy and pandas return different values for the variance by default.
This got here all the way down to the difference between how variance must be calculated for your entire statistical population being studied vs how variance must be calculated based on only a sample from that population, with different libraries making different selections in regards to the default. Finally it was shown that although each library has its default, all of them may be used to calculate each kinds of variance by utilizing either a ddof argument, a rather different function, or through a straightforward mathematical transformation.
Thanks for reading!
