Chi-Square Test: Test for Independence and Test for Goodness of Fit
The Chi-Square test is a non-parametric statistical test used to determine relationships between categorical variables or to test how well a sample fits an expected distribution. There are two main types of Chi-Square tests:
- Chi-Square Test for Independence
- Chi-Square Test for Goodness of Fit
In this blog, we will explore both types, their purposes, and how to use them.
1. Chi-Square Test for Independence
What is it?
The Chi-Square Test for Independence is used to determine whether there is a significant association between two categorical variables. It tests if the distribution of one variable is independent of the other.
For example, you might use this test to check if there is an association between gender and purchasing preferences in a survey.
Hypotheses:
- Null Hypothesis (H₀): The two categorical variables are independent (no association).
- Alternative Hypothesis (H₁): The two categorical variables are dependent (there is an association).
Formula:
The Chi-Square statistic (χ²) is calculated as:χ2=∑(O−E)2Eχ² = \sum \frac{(O – E)²}{E}χ2=∑E(O−E)2
Where:
- O = Observed frequency (the actual data)
- E = Expected frequency (what we would expect if the variables were independent)
Steps to Perform a Chi-Square Test for Independence:
- Create a Contingency Table: This table displays the frequencies of the two categorical variables in question.
- Calculate Expected Frequencies: Using the formula E=(row total×column total)grand totalE = \frac{(row \ total \times column \ total)}{grand \ total}E=grand total(row total×column total), find the expected frequency for each cell in the table.
- Calculate the Chi-Square Statistic: Sum up (O−E)2E\frac{(O – E)²}{E}E(O−E)2 for each cell.
- Determine the Degrees of Freedom (df): df=(number of rows−1)×(number of columns−1)df = (number \ of \ rows – 1) \times (number \ of \ columns – 1)df=(number of rows−1)×(number of columns−1)
- Compare the Calculated χ² with Critical Value: Look up the critical value from the Chi-Square distribution table using your degrees of freedom and chosen significance level (e.g., 0.05).
- Conclusion:
- If χcalculated2>χcritical2χ²_{calculated} > χ²_{critical}χcalculated2>χcritical2, reject the null hypothesis (there is an association).
- If χcalculated2≤χcritical2χ²_{calculated} \leq χ²_{critical}χcalculated2≤χcritical2, fail to reject the null hypothesis (no association).
Example:
Let’s say we want to test if there’s an association between Gender (Male/Female) and Preference (Product A/Product B).
Product A | Product B | Row Total | |
---|---|---|---|
Male | 30 | 50 | 80 |
Female | 70 | 50 | 120 |
Column Total | 100 | 100 | 200 |
- Expected Frequency for Male and Product A: E=80×100200=40E = \frac{80 \times 100}{200} = 40E=20080×100=40
- Perform the same calculation for all cells, then calculate the Chi-Square statistic and compare it to the critical value.
2. Chi-Square Test for Goodness of Fit
What is it?
The Chi-Square Test for Goodness of Fit is used to determine how well a sample fits a theoretical or expected distribution. It tests if the observed distribution of a single categorical variable differs from an expected distribution.
For example, you might use this test to check if a die is fair by comparing the observed roll frequencies with what you would expect from a fair die.
Hypotheses:
- Null Hypothesis (H₀): The observed data fits the expected distribution.
- Alternative Hypothesis (H₁): The observed data does not fit the expected distribution.
Formula:
The formula for the Chi-Square statistic is the same as for the test of independence:χ2=∑(O−E)2Eχ² = \sum \frac{(O – E)²}{E}χ2=∑E(O−E)2
Steps to Perform a Chi-Square Test for Goodness of Fit:
- Define Expected Frequencies: Based on the theoretical distribution (e.g., for a fair die, all faces should appear equally).
- Calculate Observed Frequencies: Collect the actual observed counts for each category.
- Calculate the Chi-Square Statistic: Use the same formula (O−E)2E\frac{(O – E)²}{E}E(O−E)2.
- Determine Degrees of Freedom (df): df=(number of categories−1)df = (number \ of \ categories – 1)df=(number of categories−1)
- Compare χ² to Critical Value: Look up the critical value based on the degrees of freedom and significance level.
- Conclusion:
- If χcalculated2>χcritical2χ²_{calculated} > χ²_{critical}χcalculated2>χcritical2, reject the null hypothesis (the observed distribution does not fit the expected distribution).
- If χcalculated2≤χcritical2χ²_{calculated} \leq χ²_{critical}χcalculated2≤χcritical2, fail to reject the null hypothesis (the observed distribution fits the expected distribution).
Example:
Let’s say we roll a die 60 times and get the following results:
Face | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
Obs. | 12 | 9 | 11 | 10 | 8 | 10 |
For a fair die, each face should appear 10 times (since 60/6=1060/6 = 1060/6=10).
- Calculate χ2=∑(O−E)2Eχ² = \sum \frac{(O – E)²}{E}χ2=∑E(O−E)2
- Degrees of freedom = 5 (since there are 6 categories, and df=6−1df = 6 – 1df=6−1)
- Compare the calculated χ² to the critical value.
Key Differences Between the Two Tests
- Chi-Square Test for Independence: Tests if there is an association between two categorical variables.
- Chi-Square Test for Goodness of Fit: Tests if the observed distribution of one categorical variable fits an expected distribution.
Conclusion
The Chi-Square Test is a robust statistical tool for analyzing relationships between categorical data or comparing observed data to expected distributions. Whether you’re testing for independence between two variables or for how well your data fits a model, the Chi-Square test provides critical insights into your data, making it an essential tool in the data scientist’s toolkit.