Understanding the Chi-Square Test Beyond the Formula

-

who has written a children’s book and released it in two versions at the identical time into the market at the identical price. One version has a basic cover design, while the opposite has a high-quality cover design, which in fact cost him more.

He then observes the sales for a certain period and gathers the information shown below.

Image by Writer

Now he involves us and desires to know whether the quilt design of his books has affected their sales.


From the sales data, we will observe that there are two categorical variables. The primary is canopy type, which is either high cost or low price, and the second is sales end result, which is either sold or not sold.

Now we wish to know whether these two categorical variables are related or not.

We all know that when we want to search out a relationship between two categorical variables, we use the Chi-square test for independence.


On this scenario, we are going to generally use Python to use the Chi-square test and calculate the chi-square statistic and p-value.

Code:

import numpy as np
from scipy.stats import chi2_contingency

# Observed data
observed = np.array([
    [320, 180],
    [350, 150]
])

chi2, p, dof, expected = chi2_contingency(observed, correction=False)

print("Chi-square statistic:", chi2)
print("p-value:", p)
print("Degrees of freedom:", dof)
print("Expected frequencies:n", expected)

Result:

Image by Writer

The chi-square statistic is 4.07 with a p-value of 0.043 which is below the 0.05 threshold. This means that the quilt type and sales are statistically associated.


We have now now obtained the p-value, but before treating it as a choice, we want to know how we got this value and what the assumptions of this test are.

Understanding this may also help us resolve whether the result we obtained is reliable or not.

Now let’s try to know what the Chi-Square test actually is.


We have now this data.

Image by Writer

By observing the information, we will say that sales for books with the high-cost cover are higher, so we might imagine that the quilt worked.

Nevertheless, in real life, the numbers fluctuate by probability even when the quilt has no effect or customers pick books randomly. We are able to still get unequal values.

Randomness at all times creates imbalances.

Now the query is, “Is that this difference larger than what randomness normally creates?”

Let’s see how Chi-Square test answers that query.


We have already got this formula to calculate the Chi-Square statistic.

[
chi^2 = sum_{i=1}^{r} sum_{j=1}^{c}
frac{(O_{ij} – E_{ij})^2}{E_{ij}}
]

where:

χ² is the Chi-Square test statistic
i represents the row index
j represents the column index
Oᵢⱼ is the observed count in row i and column j
Eᵢⱼ is the expected count in row i and column j


First let’s give attention to Expected Counts.

Before understanding what expected counts are, let’s state the hypothesis for our test.

Null Hypothesis (H₀)

The duvet type and sales end result are independent. (The duvet type has no effect)

Alternative Hypothesis (H₁)

The duvet type and sales end result usually are not independent. (The duvet type is related to whether a book is sold.)


Now what will we mean by expected counts?

Let’s say the null hypothesis is true, which implies the quilt type has no effect on the sales of books.

Let’s return to probabilities.

As we already know, the formula for easy probability is:

[P(A) = frac{text{Number of favorable outcomes}}{text{Total number of outcomes}}]

In our data, the general probability of a book being sold is:

[P(text{Sold}) = frac{text{Number of books sold}}{text{Total number of books}} = frac{670}{1000} = 0.67]

In probability, after we write P(A∣B), we mean the probability of event A provided that event B has already occurred.

[
text{Under independence, cover type and sales are not related.}
text{This means the probability of being sold does not depend on cover type.}
text{which means}
P(text{Sold} mid text{Low-cost cover}) = P(text{Sold})
P(text{Sold} mid text{High-cost cover}) = P(text{Sold})
P(text{Sold}) = frac{670}{1000} = 0.67
text{Therefore, }
P(text{Sold} mid text{Low-cost cover}) = 0.67
]

Under independence, we’ve got P (Sold | Low-cost Cover) = 0.67, which implies 67% of low-cost cover books are expected to be sold.

Since we’ve got 500 books with low-cost covers, we convert this probability into an expected variety of sold books.

[0.67 times 500 = 335]

This implies we expect 335 low-cost cover books to be sold under independence.

Based on our data table, we will represent this as E11.

Similarly, the expected value for the high-cost cover and sold can be 335, which is represented by E21.

Now let’s calculate E12 – Low-cost cover, Not Sold and E22 – High-cost cover, Not Sold.

The general probability of a book not being sold is:

[P(text{Not Sold}) = frac{330}{1000} = 0.33]

Under independence, this probability applies to every sub group as earlier.

[P(text{Not Sold} mid text{Low-cost cover}) = 0.33]

[P(text{Not Sold} mid text{High-cost cover}) = 0.33]

Now we convert this probability into the expected count of unsold books.

[E_{12} = 0.33 times 500 = 165]

[E_{22} = 0.33 times 500 = 165]


We used probabilities here to know the thought of expected counts, but we have already got direct formulas to calculate them. Let’s also take a have a look at those.

Formula to calculate Expected Counts:

[E_{ij} = frac{R_i times C_j}{N}]

Where:

  • Ri​ = Row total
  • Cj​ = Column total
  • N = Grand total

Low-cost cover, Sold:

[E_{11} = frac{500 times 670}{1000} = 335]

Low-cost cover, Not Sold:

[E_{12} = frac{500 times 330}{1000} = 165]

High-cost cover, Sold:

[E_{12} = frac{500 times 670}{1000} = 335]

High-cost cover, Not Sold:

[E_{22} = frac{500 times 330}{1000} = 165]

In each ways, we get the identical values.


By calculating expected counts, what we’re finding is that this: if we assume the null hypothesis is true, then the 2 categorical variables are independent.

Here, we’ve got 1,000 books and we all know that 670 are sold. Now we imagine randomly picking books and labeling them as sold.

After choosing 670 books, we check how lots of them belong to the low-cost cover group and what number of belong to the high-cost cover group.

If we repeat this process over and over, we might obtain values around 335. Sometimes they might be 330 or 340.

We then consider the typical, and 335 becomes the central point of the distribution if the whole lot happens purely because of randomness.

This doesn’t mean the count must equal 335, but that 335 represents the natural center of variation under independence.

The Chi-Square test then measures how far the observed count deviates from this central value relative to the variation expected under randomness.


We calculated the expected counts:

E11 = 335; E21 = 335; E12 = 165; E22 = 165

Image by Writer

The following step is to calculate the deviation between the observed and expected counts. To do that, we subtract the expected count from the observed count.

begin{aligned}
text{Low-Cost Cover & Sold:} quad & O – E = 320 – 335 = -15 [8pt]
text{Low-Cost Cover & Not Sold:} quad & O – E = 180 – 165 = 15 [8pt]
text{High-Cost Cover & Sold:} quad & O – E = 350 – 335 = 15 [8pt]
text{High-Cost Cover & Not Sold:} quad & O – E = 150 – 165 = -15
end{aligned}

In the following step, we square the differences because if we add the raw deviations, the positive and negative values cancel out, leading to zero.

This may incorrectly suggest that there isn’t a imbalance. Squaring solves the cancellation problem by allowing us to measure the magnitude of the imbalance, no matter direction.

begin{aligned}
text{Low-Cost Cover & Sold:} quad & (O – E)^2 = (-15)^2 = 225 [6pt]
text{Low-Cost Cover & Not Sold:} quad & (15)^2 = 225 [6pt]
text{High-Cost Cover & Sold:} quad & (15)^2 = 225 [6pt]
text{High-Cost Cover & Not Sold:} quad & (-15)^2 = 225
end{aligned}

Now that we’ve got calculated the squared deviations for every cell, the following step is to divide them by their respective expected counts.

This standardizes the deviations by scaling them relative to what was expected under the null hypothesis.

begin{aligned}
text{Low-Cost Cover & Sold:} quad & frac{(O – E)^2}{E} = frac{225}{335} = 0.6716 [6pt]
text{Low-Cost Cover & Not Sold:} quad & frac{225}{165} = 1.3636 [6pt]
text{High-Cost Cover & Sold:} quad & frac{225}{335} = 0.6716 [6pt]
text{High-Cost Cover & Not Sold:} quad & frac{225}{165} = 1.3636
end{aligned}


Now, for each cell, we’ve got calculated:

begin{aligned}
frac{(O – E)^2}{E}
end{aligned}

Each of those values represents the standardized squared contribution of a cell to the entire imbalance. Summing them gives the general standardized squared deviation for the table, generally known as the Chi-Square statistic.

begin{aligned}
chi^2 &= 0.6716 + 1.3636 + 0.6716 + 1.3636 [6pt]
&= 4.0704 [6pt]
&approx 4.07
end{aligned}


We obtained a Chi-Square statistic of 4.07.

How can we interpret this value?

After calculating the chi-square statistic, we compare it with the critical value from the chi-square distribution table for 1 degree of freedom at a significance level of 0.05.

For df = 1 and α = 0.05, the critical value is 3.84. Since our calculated value (4.07) is larger than 3.84, we reject the null hypothesis.


The chi-square test is complete at this point, but we still need to know what df = 1 means and the way the critical value of three.84 is obtained.

That is where things begin to get each interesting and barely confusing.

First, let’s understand what df = 1 means.

‘df’ means Degrees of Freedom.

From our data,

Image by Writer

We are able to call this a Contingency table and to be specific it’s a 2*2 contingency table since it is defined by variety of categories in variable 1 as rows and variety of categories in variable 2 as columns. Here we’ve got 2 rows and a couple of columns.

We are able to observe that the row totals and column totals are fixed. Which means if one cell value changes, the opposite three must adjust accordingly to preserve those totals.

In other words, there is simply one independent way the table can vary while keeping the row and column totals fixed. Subsequently, the table has 1 degree of freedom.

We may compute the degrees of freedom using the usual formula for a contingency table:

[
df = (r – 1)(c – 1)
]

where r is the variety of rows and c is the variety of columns.

In our example, we’ve got a 2*2 table, so:

[
df = (2 – 1)(2 – 1)
]

[
df = 1
]


We now have an idea of what degrees of freedom mean from the information table. But why do we want to calculate them?

Now, let’s imagine a four-dimensional space through which each axis corresponds to at least one cell of the contingency table:

Axis 1: Low-cost & Sold

Axis 2: Low-cost & Not Sold

Axis 3: High-cost & Sold

Axis 4: High-cost & Not Sold

From the information table, we’ve got the observed counts (320, 180, 350, 150). We also calculated the expected counts under independence as (335, 165, 335, 165).

Each the observed and expected counts may be represented as points in a four-dimensional space.

Now we’ve got two points in a four-dimensional space.

We already calculated the difference between observed and expected counts (-15, 15, 15, -15).

We are able to write it as -15(1, -1, -1, 1)

Within the observed data,

Image by Writer

Let’s say we increase the Low-cost & Sold count from 320 to 321 (a +1 change).

To maintain the row and column totals fixed, Low-cost & Not Sold must decrease by 1, High-cost & Sold must decrease by 1, and High-cost & Not Sold must increase by 1.

This produces the pattern (1, −1, −1, 1).

Any valid change in a 2×2 table with fixed margins follows this same pattern multiplied by some scalar.

Under fixed row and column totals, many various 2×2 tables are possible. Once we represent each table as some extent in four-dimensional space, these tables lie on a one-dimensional straight line.

We are able to consult with the expected counts, (335, 165, 335, 165), as the middle of that straight line and let’s denote that time as E.

The purpose E lies at the middle of the road because, under pure randomness (independence), these are the values we expect to watch.

We then measure how much the observed counts deviate from these expected counts.

We are able to observe that each point on the road is:

E + x (1, −1, −1, 1)

where x is any scalar.

From our observed data table, we will write it as:

O = E + (-15) (1, −1, −1, 1)

Similarly, every point may be written like this.


The (1, −1, −1, 1) defines the direction of the one-dimensional deviation space. We call it as a direction vector. Scalar value just tells us how far to maneuver in that direction.

Every valid table is obtained by starting on the expected table and moving far along this direction.

For instance, any point on the road is (335+x, 165-x, 335-x, 165+x).

Substituting x=−15, the values grow to be
(335−15, 165+15, 335+15, 165−15),
which simplifies to (320, 180, 350, 150).
This matches our observed table.


We are able to imagine that as x changes, the table moves only in a single direction along a straight line.

Which means the complete deviation from independence is controlled by a single scalar value, which moves the table along a straight line.

Since all tables lie along a one-dimensional line, the system has just one independent direction of movement. For this reason the degrees of freedom equal 1.


At this point, we all know compute the chi-square statistic. As derived earlier, standardizing the deviation from the expected count and squaring it leads to a chi-square value of 4.07.


Now that we understand what degrees of freedom mean, let’s explore what the chi-square distribution actually is.

Coming back to our observed data, we’ve got 1000 books in total. Out of those, 670 were sold and 330 weren’t sold.

Under the idea of independence (i.e., cover type doesn’t influence whether a book is sold), we will imagine randomly choosing 670 books out of 1000 and labeling them as “sold.”

We then count how lots of these chosen books have a low-cost cover type. Let this count be denoted by X.

If we repeat this experiment over and over as discussed earlier, each repetition would produce a distinct value of X, akin to 321, 322, 326 and so forth.

Now if we plot these values across many repetitions, then we will observe that the values cluster around 335, forming a bell-shape curve.

Plot:

Image by Writer

We are able to observe the Normal Distribution.

From our observed data table, the variety of Low-cost and Sold books is 320. The distribution shown above represents how values behave under independence.

We see that values like 334 and 336 are common, while 330 and 340 are somewhat less common. A price like 320 appears to be relatively rare.

But how will we determine this accurately? To reply that, we must compare 320 to the middle of the distribution, which is 335, and consider how wide the curve is.

The width of the curve reflects how much natural variation we expect under independence. Based on this spread, we will assess how regularly a price like 320 would occur.

For that we want to perform Standardization.

Expected value: ( mu = 335 )

Observed value: ( X = 320 )

Difference: ( 320 – 335 = -15 )

Standard deviation: ( sigma approx 7.44 )

[
Z = frac{320 – 335}{7.44} approx -2.0179
]

So, 320 is about two standard deviations below the typical.

We already know that we calculated the Z-score here.

The Z-score of 320 is roughly −2.0179.


In the identical way, if we standardize each possible of X, then the above sampling distribution of X gets transformed into the usual normal distribution with mean = 0 and standard deviation = 1.

Image by Writer

Now we already know that 320 is about two standard deviations below the typical.

Z-Rating = -2.0179

We already computed a chi-square statistic equal to 4.07.

Now let’s square the Z-Rating

Z2 = (−2.0179)2 = 4.0719 and this is the same as our chi-square statistic.


If a standardized deviation follows a regular normal distribution, then squaring that random variable transforms the distribution right into a chi-square distribution with one degree of freedom.

Image by Writer

That is the curve obtained after we square a regular normal random variable Z. Since squaring removes the sign, each positive and negative values of Z map to positive values.

In consequence, the symmetric bell-shaped distribution is transformed right into a right-skewed distribution that follows a chi-square distribution with one degree of freedom.


When the degrees of freedom is 1, we actually don’t must think when it comes to squaring to make a choice.

There is simply one independent deviation from independence, so we will standardize it and perform a two-sided Z-test.

Squaring simply turns that Z value right into a chi-square value, when df = 1. Nevertheless, when the degrees of freedom are greater than 1, there are multiple independent deviations.

If we just add those deviations together, positive and negative values cancel out.

Squaring ensures that every one deviations contribute positively to the entire deviation.

That’s the reason the chi-square statistic at all times sums squared standardized deviations, especially when df is larger than 1.


We now have a clearer understanding of how the conventional distribution is linked to the chi-square distribution.

Now let’s use this distribution to perform hypothesis testing.

Null Hypothesis (H₀)

The duvet type and sales end result are independent. (The duvet type has no effect)

Alternative Hypothesis (H₁)

The duvet type and sales end result usually are not independent. (The duvet type is related to whether a book is sold.)

A commonly used significance level is α = 0.05. This implies we reject the null hypothesis provided that our result falls inside essentially the most extreme 5% of outcomes under the null hypothesis.

From the Chi-Square distribution at df = 1 and α = 0.05: the critical value is 3.84.

The worth 3.84 is the critical (cut-off) value. The realm to the proper of three.84 equals 0.05, representing the rejection region.

Since our calculated chi-square statistic exceeds 3.84, it falls inside this rejection region.

Image by Writer

The p-value here is 0.043, which is the world to the proper of 4.07.

This implies if cover type and sales were truly independent, there can be only a 4.3% probability of observing a difference this massive.


Now whether these results are reliable or not is dependent upon the assumptions of the chi-square test.

Let’s have a look at the assumptions for this test:

1) Independence of Observations

On this context, independence implies that one book sale shouldn’t influence one other. The identical customer shouldn’t be counted multiple times, and observations shouldn’t be paired or repeated.

2) Data have to be Categorical counts.

3) Expected Frequencies Should Not Be Too Small

All expected cell counts should generally be not less than 5.

4) Random Sampling

The sample should represent the population.


Because all of the assumptions are satisfied and the p-value (0.043) is below 0.05, we reject the null hypothesis and conclude that cover type and sales are statistically associated.


At this point, you may be confused about something.

We spent a whole lot of time specializing in one cell, for instance the low-cost books that were sold.

We calculated its deviation, standardized it, and used that to know how the chi square statistic is formed.

But what concerning the other cells? What about high-cost books or the unsold ones?

The essential thing to understand is that in a 2×2 table, all 4 cells are connected. Once the row totals and column totals are fixed, the table has only one degree of freedom.

This implies the counts cannot vary independently. If one cell increases, then other cells robotically adjusted to maintain the totals consistent.

As we discussed earlier, we will consider all possible tables with the identical margins as points in a four-dimensional space.

Nevertheless, due to the constraints imposed by the fixed totals, those points don’t opened up in every direction. As an alternative, they lie along a single straight line, which we already discussed earlier.

Every deviation from independence moves the table only along that one direction, which we discussed earlier.

So, when one cell deviates by, say, +15 from its expected value, the opposite cells are robotically determined by the structure of the table.

The entire table shifts together. The deviation just isn’t nearly one number. It represents the movement of the complete system.

Once we compute the chi square statistic, we subtract observed from expected for all cells and standardize each deviation.

But in a 2×2 table, those deviations are tied together. They move as one coordinated structure.

This implies, examining one cell is enough to know how far the complete table has moved away from independence and likewise concerning the distribution.


Learning never ends, and there continues to be way more to explore concerning the chi-square test.

I hope this text has given you a transparent understanding of what the chi-square test actually does.

In one other blog, we are going to discuss what happens when the assumptions usually are not met and why the chi-square test may fail in those situations.

There was a small pause in my time series series. I noticed that a number of topics deserved more clarity and careful considering, so I made a decision to decelerate as an alternative of pushing forward. I’ll return to it soon with explanations that feel more complete and intuitive.

In the event you enjoyed this text, you possibly can explore more of my writing on Medium and LinkedIn.

Thanks for reading!

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x