Chi-Square Test

Abhilash Jose
Abhilash Jose  - Data Scientist | Data Analyst
6 Min Read

Chi-Square Test: Test for Independence and Test for Goodness of Fit

The Chi-Square test is a non-parametric statistical test used to determine relationships between categorical variables or to test how well a sample fits an expected distribution. There are two main types of Chi-Square tests:

  1. Chi-Square Test for Independence
  2. Chi-Square Test for Goodness of Fit

In this blog, we will explore both types, their purposes, and how to use them.


1. Chi-Square Test for Independence

What is it?

The Chi-Square Test for Independence is used to determine whether there is a significant association between two categorical variables. It tests if the distribution of one variable is independent of the other.

For example, you might use this test to check if there is an association between gender and purchasing preferences in a survey.

Hypotheses:

  • Null Hypothesis (H₀): The two categorical variables are independent (no association).
  • Alternative Hypothesis (H₁): The two categorical variables are dependent (there is an association).

Formula:

The Chi-Square statistic (χ²) is calculated as:χ2=∑(O−E)2Eχ² = \sum \frac{(O – E)²}{E}χ2=∑E(O−E)2​

Where:

  • O = Observed frequency (the actual data)
  • E = Expected frequency (what we would expect if the variables were independent)

Steps to Perform a Chi-Square Test for Independence:

  1. Create a Contingency Table: This table displays the frequencies of the two categorical variables in question.
  2. Calculate Expected Frequencies: Using the formula E=(row total×column total)grand totalE = \frac{(row \ total \times column \ total)}{grand \ total}E=grand total(row total×column total)​, find the expected frequency for each cell in the table.
  3. Calculate the Chi-Square Statistic: Sum up (O−E)2E\frac{(O – E)²}{E}E(O−E)2​ for each cell.
  4. Determine the Degrees of Freedom (df): df=(number of rows−1)×(number of columns−1)df = (number \ of \ rows – 1) \times (number \ of \ columns – 1)df=(number of rows−1)×(number of columns−1)
  5. Compare the Calculated χ² with Critical Value: Look up the critical value from the Chi-Square distribution table using your degrees of freedom and chosen significance level (e.g., 0.05).
  6. Conclusion:
    • If χcalculated2>χcritical2χ²_{calculated} > χ²_{critical}χcalculated2​>χcritical2​, reject the null hypothesis (there is an association).
    • If χcalculated2≤χcritical2χ²_{calculated} \leq χ²_{critical}χcalculated2​≤χcritical2​, fail to reject the null hypothesis (no association).

Example:

Let’s say we want to test if there’s an association between Gender (Male/Female) and Preference (Product A/Product B).

Product AProduct BRow Total
Male305080
Female7050120
Column Total100100200
  • Expected Frequency for Male and Product A: E=80×100200=40E = \frac{80 \times 100}{200} = 40E=20080×100​=40
  • Perform the same calculation for all cells, then calculate the Chi-Square statistic and compare it to the critical value.


2. Chi-Square Test for Goodness of Fit

What is it?

The Chi-Square Test for Goodness of Fit is used to determine how well a sample fits a theoretical or expected distribution. It tests if the observed distribution of a single categorical variable differs from an expected distribution.

For example, you might use this test to check if a die is fair by comparing the observed roll frequencies with what you would expect from a fair die.

Hypotheses:

  • Null Hypothesis (H₀): The observed data fits the expected distribution.
  • Alternative Hypothesis (H₁): The observed data does not fit the expected distribution.

Formula:

The formula for the Chi-Square statistic is the same as for the test of independence:χ2=∑(O−E)2Eχ² = \sum \frac{(O – E)²}{E}χ2=∑E(O−E)2​

Steps to Perform a Chi-Square Test for Goodness of Fit:

  1. Define Expected Frequencies: Based on the theoretical distribution (e.g., for a fair die, all faces should appear equally).
  2. Calculate Observed Frequencies: Collect the actual observed counts for each category.
  3. Calculate the Chi-Square Statistic: Use the same formula (O−E)2E\frac{(O – E)²}{E}E(O−E)2​.
  4. Determine Degrees of Freedom (df): df=(number of categories−1)df = (number \ of \ categories – 1)df=(number of categories−1)
  5. Compare χ² to Critical Value: Look up the critical value based on the degrees of freedom and significance level.
  6. Conclusion:
    • If χcalculated2>χcritical2χ²_{calculated} > χ²_{critical}χcalculated2​>χcritical2​, reject the null hypothesis (the observed distribution does not fit the expected distribution).
    • If χcalculated2≤χcritical2χ²_{calculated} \leq χ²_{critical}χcalculated2​≤χcritical2​, fail to reject the null hypothesis (the observed distribution fits the expected distribution).

Example:

Let’s say we roll a die 60 times and get the following results:

Face123456
Obs.1291110810

For a fair die, each face should appear 10 times (since 60/6=1060/6 = 1060/6=10).

  • Calculate χ2=∑(O−E)2Eχ² = \sum \frac{(O – E)²}{E}χ2=∑E(O−E)2​
  • Degrees of freedom = 5 (since there are 6 categories, and df=6−1df = 6 – 1df=6−1)
  • Compare the calculated χ² to the critical value.

Key Differences Between the Two Tests

  • Chi-Square Test for Independence: Tests if there is an association between two categorical variables.
  • Chi-Square Test for Goodness of Fit: Tests if the observed distribution of one categorical variable fits an expected distribution.

Conclusion

The Chi-Square Test is a robust statistical tool for analyzing relationships between categorical data or comparing observed data to expected distributions. Whether you’re testing for independence between two variables or for how well your data fits a model, the Chi-Square test provides critical insights into your data, making it an essential tool in the data scientist’s toolkit.

Share this Article
By Abhilash Jose Data Scientist | Data Analyst
Follow:
Abhilash Jose is a data scientist and data analyst from Kerala, India. He specializes in data analysis and is well-known for his expertise in areas such as machine learning and statistical modeling. Abhilash is recognized as a top freelance data scientist in India, with a focus on extracting meaningful insights from data to drive informed decision-making. His skills encompass a wide range of techniques, including data mining, predictive modeling, and data visualization.
Leave a comment