Key Terms in Statistics

Abhilash Jose
Abhilash Jose  - Data Science Specialist
10 Min Read

Descriptive Statistics

  1. Population: The entire group you want to study.
    Example: All the students in a university.
  2. Sample: A smaller group selected from the population.
    Example: 100 students randomly chosen from the university.
  3. Parameter: A numerical value that describes a characteristic of a population.
    Example: The average height of all students in the university (e.g., 5.7 feet).
  4. Statistic: A numerical value that describes a characteristic of a sample.
    Example: The average height of the 100 students in the sample (e.g., 5.5 feet).
  5. Mean (μ or x̄): The average of a set of numbers.
    Example: For the numbers 2, 4, 6, the mean is (2 + 4 + 6) / 3 = 4.
  6. Trimmed Mean: The mean calculated after removing a specified percentage of the smallest and largest values.
    Example: For the dataset 1, 2, 3, 4, 100, if you trim 20%, you would remove 1 and 100, and then the trimmed mean would be (2 + 3 + 4) / 3 = 3.
  7. Median: The middle number in a sorted list of numbers.
    Example: In the list 1, 3, 3, 6, 7, 8, 9, the median is 6.
    • Even Occurrence
    • Odd Occurrence
  8. Mode: The number that appears most frequently in a set.
    Example: In the list 1, 2, 2, 3, 4, the mode is 2.
  9. Standard Deviation (σ or s): A measure of how spread out the numbers are from the mean.
    Example: If the heights of students are close to the average, the standard deviation is small; if they are very different, it’s large.
  10. Variance: The average of the squared differences from the mean, another way to measure spread.
    Example: If the heights are 5, 6, and 7 feet, the variance helps us understand how much these heights differ from the average height.
  11. Coefficient of Variation (CV): The ratio of the standard deviation to the mean, expressed as a percentage.
    Example: If the average test score is 80 and the standard deviation is 10, the CV is (10/80) × 100 = 12.5%.
  12. Percentiles and Quartiles: Percentiles divide data into 100 equal parts; quartiles divide it into four.
    • Percentile: If you score in the 90th percentile on a test, you did better than 90% of test-takers.
    • Quartiles: The first quartile (Q1) is the value below which 25% of the data falls.
  13. Skewness: A measure of the asymmetry of a distribution.
    • Positive Skew: Most data points are low, with a few high outliers (e.g., incomes).
    • Negative Skew: Most data points are high, with a few low outliers.
    • Symmetric: The data is evenly distributed around the mean.
  14. Kurtosis: A measure of the “tailedness” of a distribution.
    • Leptokurtic: More outliers than a normal distribution.
    • Mesokurtic: Similar to a normal distribution.
    • Platykurtic: Fewer outliers than a normal distribution.
  15. Outliers: Data points that are significantly different from the rest.
    Example: In test scores of 70, 75, 80, and 30, the score of 30 is an outlier.
    • Detection Methods:
      • IQR Rule: An outlier is any value below Q1 – 1.5 × IQR or above Q3 + 1.5 × IQR.
      • Z-Scores: A score that is more than 3 standard deviations away from the mean is typically considered an outlier.
  16. Histogram: A graphical representation of the distribution of numerical data, using bars to show frequency.
    Example: A histogram of test scores might show how many students scored in each score range.
  17. Box Plot (Box-and-Whisker Plot): A graphical display of data that shows the median, quartiles, and outliers.
    Example: A box plot can quickly show the spread and center of student scores, highlighting any outliers.
  18. Frequency Measures
    • Frequency Distribution: A summary of how often each value occurs in a dataset.
      Example: In a survey of pet ownership, you might find that 20 people have dogs, 15 have cats, and 5 have birds.
    • Cumulative Frequency: A running total of frequencies, showing how many observations fall below a certain value.
      Example: If 10 students score below 60, and 15 score below 70, the cumulative frequency for below 70 is 25 (10 + 15).
    • Relative Frequency: The proportion of the total number of observations that falls within a particular category, expressed as a percentage or fraction.
      Example: If 30 out of 100 students score above 75, the relative frequency is 30/100 = 0.30 or 30%.
  19. Distribution
    • Normal Distribution
    • Standard Distribution

Inferential Statistics

Hypothesis Testing

  1. Null Hypothesis (H₀): The statement that there is no effect or no difference, which we aim to test.
    Example: “There is no difference in test scores between two teaching methods.”
  2. Alternative Hypothesis (H₁): The statement that there is an effect or a difference.
    Example: “There is a difference in test scores between two teaching methods.”
  3. p-value: The probability of obtaining results at least as extreme as the observed results, assuming the null hypothesis is true.
    Example: A p-value of 0.03 means there is a 3% chance of observing the data if the null hypothesis is true.
  4. Confidence Interval: A range of values used to estimate a population parameter, indicating the degree of uncertainty.
    Example: A 95% confidence interval for the average height might be (5.5 feet, 5.9 feet), meaning we are 95% confident the true average height falls within this range.
  5. Confidence Level: The percentage of times the confidence interval would contain the true population parameter if you repeated the experiment many times.
    Example: A 95% confidence level means if we took many samples, 95% of the calculated intervals would contain the true population mean.
  6. Error in Hypothesis
    • Type I Error (False Positive): Rejecting the null hypothesis when it is actually true.
      Example: Concluding that a new drug is effective when it actually is not.
    • Type II Error (False Negative): Failing to reject the null hypothesis when it is actually false.
      Example: Concluding that a new drug is not effective when it actually is.
  7. Power of a Test: The probability of correctly rejecting a false null hypothesis.
    Example: A power of 0.8 means there is an 80% chance of detecting an effect if there is one.
  8. Effect Size: A measure of the strength of a relationship or the magnitude of an effect.
    • Cohen’s d: Measures the difference between two group means in terms of standard deviation.
    • Pearson’s r: Measures the strength and direction of the linear relationship between two variables.
  9. Degrees of Freedom (df): The number of independent values or quantities that can vary in an analysis without violating any given constraints.
    Example: In a t-test with 10 samples, df = 10 – 1 = 9.

Statistical Tests

  1. t-test: A test used to compare the means of two groups.
    Example: Comparing the test scores of students taught by two different methods.
  2. z-test: A test used to determine if there is a significant difference between sample and population means when the sample size is large.
    Example: Testing if a sample mean height of 5.6 feet significantly differs from a population mean of 5.7 feet.
  3. ANOVA (Analysis of Variance): A statistical method used to compare means among three or more groups.
    Example: Testing if students from three different schools have different average test scores.
  4. Chi-Square Test: A test used to determine if there is a significant association between categorical variables.
    Example: Testing if there is a relationship between gender and preference for a type of music.
  5. F-Test: A test used to compare the variances of two populations.
    Example: Testing if two different teaching methods have different levels of variability in test scores.
  6. Levene’s Test: A test to assess the equality of variances for a variable calculated for two or more groups.
    Example: Checking if the variances of test scores among different classes are equal.
  7. Shapiro-Wilk Test: A test used to check if data follows a normal distribution.
    Example: Testing if a set of student scores is normally distributed before applying parametric tests.
  8. Post Hoc Test: Tests conducted after ANOVA to find out which specific group means are different.
    Example: If ANOVA shows significant differences among three groups, a post hoc test will identify which groups differ.

Regression and Correlation

  1. Regression: A statistical method used to understand the relationship between dependent and independent variables.
    Example: Predicting a student’s final grade based on their study hours using a regression line.
  2. Correlation: A measure of the strength and direction of the relationship between two variables.
    • Positive Correlation: As one variable increases, the other also increases (e.g., more study hours lead to higher grades).
    • Negative Correlation: As one

Share this Article
By Abhilash Jose Data Science Specialist
Follow:
Abhilash Jose is a data science specialist from India. He specializes in data analysis and is well-known for his expertise in areas such as machine learning and statistical modeling. His skills encompass a wide range of techniques, including data mining, predictive modeling, and data visualization.
Leave a comment