A Tale of Two Variances: Why NumPy and Pandas Give Different Answers

-

you’re analyzing a small dataset:

[X = [15, 8, 13, 7, 7, 12, 15, 6, 8, 9]]

You must calculate some summary statistics to get an idea of the distribution of this data, so you utilize numpy to calculate the mean and variance.

import numpy as np

X = [15, 8, 13, 7, 7, 12, 15, 6, 8, 9]
mean = np.mean(X)
var = np.var(X)

print(f"Mean={mean:.2f}, Variance={var:.2f}")

Your output Looks like this:

Mean=10.00, Variance=10.60

Great! Now you have got an idea of the distribution of your data. Nevertheless, your colleague comes along and tells you that additionally they calculated some summary statistics on this same dataset using the next code:

import pandas as pd

X = pd.Series([15, 8, 13, 7, 7, 12, 15, 6, 8, 9])
mean = X.mean()
var = X.var()

print(f"Mean={mean:.2f}, Variance={var:.2f}")

Their output looks like this:

Mean=10.00, Variance=11.78

The means are the identical, however the variances are different! What gives?

This discrepancy arises because numpy and pandas use different default equations for calculating the variance of an array. This text will mathematically define the 2 variances, explain why they differ, and show easy methods to use either equation in several numerical libraries.


Two Definitions

There are two standard ways to calculate the variance, each meant for a distinct purpose. It comes all the way down to whether you’re calculating the variance of your entire population (the whole group you’re studying) or simply a sample (a smaller subset of that population you really have data for).

The population variance, σ2sigma^2, is defined as:

[sigma^2 = frac{sum_{i=1}^N(x_i-mu)^2}{N}]

While the sample variance, s2s^2, is defined as:

[s^2 = frac{sum_{i=1}^n(x_i-bar x)^2}{n-1}]

xix_iNNnnxbar x

Notice the 2 key differences between these equations:

  1. Within the numerator’s sum, σ2sigma^2 is calculated using the population mean, μmu, while s2s^2 is calculated using the sample mean, xbar x.
  2. Within the denominator, σ2sigma^2 divides by the whole population size NN, while s2s^2 divides by the sample size minus one, n1n-1.

It must be noted that the excellence between these two definitions matters probably the most for small sample sizes. As nn grows, the excellence between nn and n1n-1 becomes less and less important.


Why Are They Different?

When calculating the population variance, it’s assumed that you have got all the info. You already know the precise center (the population mean μmu) and exactly how far every point is from that center. Dividing by the whole number of information points NN gives the true, exact average of those squared differences.

Nevertheless, when calculating the sample variance, it isn’t assumed that you have got all the info so that you do not need the true population mean μmu. As a substitute, you simply have an estimate of μmu, which is the sample mean xbar x. Nevertheless, it seems that using the sample mean as a substitute of the true population mean tends to underestimate the true population variance on average.

This happens since the sample mean is calculated directly from the sample data, meaning it sits in the precise mathematical center of that specific sample. In consequence, the info points in your sample will all the time be closer to their very own sample mean than they’re to the true population mean, resulting in an artificially smaller sum of squared differences.

To correct for this underestimation, we apply what is named Bessel’s correction (named for German mathematician Friedrich Wilhelm Bessel), where we divide not by nn, but by the marginally smaller n1n-1 to correct for this bias, as dividing by a smaller number makes the ultimate variance barely larger.

Degrees of Freedom

So why divide by n1n-1 and never n2n-2 or n3n-3 or another correction that also increases the ultimate variance? That comes all the way down to an idea called the Degrees of Freedom.

The degrees of freedom refers back to the variety of independent values in a calculation which can be free to differ. For instance, imagine you have got a set of three numbers, (x1,x2,x3)(x_1, x_2, x_3). You have no idea what the values of those are but you do know that their sample mean x=10bar x = 10.

  • The primary number x1x_1 might be anything (let’s say 8)
  • The second number x2x_2 is also anything (let’s say 15)
  • Since the mean have to be 10, x3x_3 isn’t free to differ and have to be the one number such that x=10bar x = 10, which on this case is 7.

So in this instance, despite the fact that there are 3 numbers, there are only two degrees of freedom, as enforcing the sample mean removes the flexibility of one in all them to be free to differ.

Within the context of variance, before making any calculations, we start with nn degrees of freedom (corresponding to our nn data points). The calculation of the sample mean xbar x essentially uses up one degree of freedom, so by the point the sample variance is calculated, there are n1n-1 degrees of freedom left to work with, which is why n1n-1 is what appears within the denominator.


Library Defaults and How one can Align Them

Now that we understand the mathematics, we will finally solve the mystery from the start of the article! numpy and pandas gave different results because they default to different variance formulas.

Many numerical libraries control this using a parameter called ddof, which stands for Delta Degrees of Freedom. This represents the worth subtracted from the whole variety of observations within the denominator.

  • Setting ddof=0 divides the equation by nn, calculating the population variance.
  • Setting ddof=1 divides the equation by n1n-1, calculating the sample variance.

These can be applied when calculating the usual deviation, which is just the square root of the variance.

Here’s a breakdown of how different popular libraries handle these defaults and the way you’ll be able to override them:

numpy

By default, numpy assumes you’re calculating the population variance (ddof=0). When you are working with a sample and want to use Bessel’s correction, you could explicitly pass ddof=1.

import numpy as np
X = [15, 8, 13, 7, 7, 12, 15, 6, 8, 9]          

# Sample variance and standard deviation
np.var(X, ddof=1)
np.std(X, ddof=1)

# Population variance and standard deviation (Default)
np.var(X)
np.std(X)

pandas

By default, pandas takes the alternative approach. It assumes your data is a sample and calculates the sample variance (ddof=1). To calculate the population variance as a substitute, you could pass ddof=0.

import pandas as pd
X = pd.Series([15, 8, 13, 7, 7, 12, 15, 6, 8, 9])

# Sample variance and standard deviation (Default)
X.var()
X.std()          

# Population variance and standard deviation 
X.var(ddof=0)
X.std(ddof=0)

Python’s Built-in statistics Module

Python’s standard library doesn’t use a ddof parameter. As a substitute, it provides explicitly named functions so there is no such thing as a ambiguity about which formula is getting used.

import statistics
X = [15, 8, 13, 7, 7, 12, 15, 6, 8, 9]

# Sample variance and standard deviation
statistics.variance(X)
statistics.stdev(X)  

# Population variance and standard deviation
statistics.pvariance(X)
statistics.pstdev(X)

R

In R, the usual var() and sd() functions calculate the sample variance and sample standard deviation by default. Unlike the Python libraries, R doesn’t have a built-in argument to simply swap to the population formula. To calculate the population variance, you could manually multiply the sample variance by n1nfrac{n-1}{n}.

X <- c(15, 8, 13, 7, 7, 12, 15, 6, 8, 9)
n <- length(X)

# Sample variance and standard deviation (Default)
var(X)
sd(X)

# Population variance and standard deviation
var(X) * ((n - 1) / n)
sd(X) * sqrt((n - 1) / n)

Conclusion

This text explored a frustrating yet often unnoticed trait of various statistical programming languages and libraries — they elect to make use of different default definitions of variance and standard deviation. An example was given where for a similar input array, numpy and pandas return different values for the variance by default.

This got here all the way down to the difference between how variance must be calculated for your entire statistical population being studied vs how variance must be calculated based on only a sample from that population, with different libraries making different selections in regards to the default. Finally it was shown that although each library has its default, all of them may be used to calculate each kinds of variance by utilizing either a ddof argument, a rather different function, or through a straightforward mathematical transformation.

Thanks for reading!

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x