Chi square test of independence

# Chi Square Test of Independence

## Chi square distribution method

In this post we would look at the Chi Square test of independence method of Hypothesis testing.

The Chi Square tests or Chi Square Calculation is used to perform Hypothesis testing on categorical variables that we might encounter.

The Chi Square Calculation is used to determine whether or not there is a significant relationship between two nominal (categorical) variables.

When to use Chi Square Test?

For example,
During the pandemic (Covid-19), there was news that females were less likely to get affected by the novel corona virus as compared to males.

Now, you would like to perform some hypothesis testing and check whether the claim holds good or not.

Since male and female are categorical variables here, you would use the Chi Square Calculation (or the Chi Square Tests) to perform your hypothesis testing.

How do we do the Chi Square Calculation?

First, we defined our null and alternate hypothesis as following:

Ho (Null Hypothesis) = Male and Female are equally likely to get affected by the virus.
H1 (Alternate Hypothesis) = Male and Female are NOT equally likely to get affected by the virus.

Chi Square Calculation

Now that our null and alternate hypothesis is defined, we start by assuming the null hypothesis is true.
We decide to take a sample of let’s say 1000 people who were affected by covid-19.
Since our null hypothesis (assumed true) was that both male and female are equally likely to get affected (meaning 50% chance of male and 50% chance of female getting affected), we would expect 500 out of the 1000 sample to be males and remaining 500 to be female.

This is what we are expecting (not a factual reality).

Now, after taking the sample we find that, out of the 1000 people (affected by covid-19) 620 turned out to be males and remaining 380 turned out to be females.

Let’s tabulate this data as follows:

Category Expected Observed
Male 500 620
Female 500 380

Now, the Chi Square Formula or the Chi Square Equation is given as:

X^2 = Summation[ ( O - E )^2 / E ]

Basically, you would find the difference between the Observed value (O) and Expected Value (E), take the square of this difference and divide by the Expected value.
You would find this value for each category (male and female in our example) and take the sum.

Chi Square Calculation for our example.

Male: (620 - 500 )^2 / 500
= 14400 / 500
= 28.8

Female: (380-500)^2 / 500
= 14400 / 500
= 28.8

Summation:
28.8 + 28.8 = 57.6

Therefore our Chi Square Calculation comes out as 57.6

To make our decision we need two more metrics.

One is the degree of freedom and the other is the significance level.

Degree of freedom is the number of categories - 1.
So, in our example, it is 2 - 1 = 1.
Since we had only 2 categories.

Let’s take the significance level as 5%.

Now, using the degree of freedom and the significance level we would calculate the Chi Square Critical Value from the Chi Square Distribution given below.
The first column depicts the degree of freedom and the first row depicts the significance level for which we want to find the critical value. In our example, the critical value comes out as 3.84

This Critical Value is then compared with the Chi Square we had calculated earlier.
If our calculated Chi Square is more than the critical value then we can cay that the observed frequencies of the categories (males and females in our example) is significantly different from the expected values and thus we can reject our null hypothesis.

Since our Chi Square calculation came out as 57.6, we can reject our null hypothesis and conclude that males and females are not equally likely to get affected by the virus.