Data Cleaning and Hypothesis Testing Notes

Data Cleaning and Hypothesis Testing Notes

Population Notation Refresher

  • When discussing samples, we use sample notation.
  • For populations, we utilize population notation.
  • Sample Statistics are used to estimate Population Parameters:
    • Mean: yˉ\bar{y} (Sample Mean) vs. β\beta (Population Mean)
    • Median: ildeyilde{y} (Sample Median) vs. ildeβilde{\beta} (Population Median)
    • Standard Deviation: ss (Sample SD) vs. auau (Population SD)
    • Variance: s2s^2 (Sample Variance) vs. au2au^2 (Population Variance)

Hypothesis Testing Steps Refresher

  1. Hypotheses: Define the null (H<em>0H<em>0) and alternative (H</em>1H</em>1) hypotheses.
  2. Assumptions: Verify the assumptions of the test.
  3. Test Statistic: Calculate the test statistic.
  4. p-value: Determine the p-value from the test statistic distribution.
  5. Decision:
    • If p < 0.05, reject H0H_0.
    • If p
      ot< 0.05, do not reject H0H_0.
  6. Conclusion: Summarize findings in relation to the population.

Significant Results

  • A significance level of 5% (α = 0.05):
    • If p < 0.05: Reject H0H_0, indicating significant evidence against the null hypothesis.

Non-Significant Results

  • A significance level of 5% (α = 0.05):
    • If p > 0.05: Fail to reject H0H_0, indicating inconclusive evidence.

Data Cleaning

  • Graphical exploration and numerical summaries (counts, means, standard deviations) are vital for valid data.
  • Problems in data can include:
    • Incomplete: Missing values.
    • Noisy: Presence of outliers or errors (e.g., Salary = "-10").
    • Inconsistent: Discrepancies in data codes or formats (e.g., Age = "42", Birthday = "03/07/1997").
Data Preparation Steps
  1. Choose variables for analysis.
  2. Check data for correctness.
  3. Correct errors and address missing or extreme values.
  4. Format data appropriately (units, data types).
  5. Create new variables from existing data.
Missing Data Types
  1. Missing at Random: Missingness is random.
  2. Missing Not at Random: Missingness is systematically related to unobserved data.
    • Examples:
      • Battery dead = Missing at random
      • Broken scale = Missing not at random
Outliers
  • An outlier is a data point significantly different from others.
  • Reasons to remove: if deemed faulty; reasons to keep: if they represent valid measurements.

Central Limit Theorem (CLT)

  • The CLT states that: The mean of a random sample will have a sampling distribution approximately normal.
  • Formula: Xˉβauextn\frac{\bar{X} - \beta}{\frac{ au}{ ext{n}}} is approximately normal for large enough samples.
  • A sample size of nextofatleast25n ext{ of at least } 25 is generally sufficient unless the population is already normal.

One-Sample t-Test

  • Hypotheses:
    • Null (H<em>0H<em>0): β=β</em>0\beta = \beta</em>0
    • Alternative (H<em>1H<em>1): ββ</em>0\beta \neq \beta</em>0
  • Assumptions:
    1. Numeric scores.
    2. Independent observations.
    3. Approximately normally distributed data if n < 25.
  • Test Statistic: t=yˉβsextnt = \frac{\bar{y} - \beta}{\frac{s}{ ext{n}}}

Two-Sample t-Test

  • Hypotheses:
    • Null: H<em>0:β</em>1=β2H<em>0: \beta</em>1 = \beta_2
    • Alternative: H<em>1:β</em>1<br/>β2H<em>1: \beta</em>1 <br />\neq \beta_2
  • Assumptions include independent observations, normal distribution, and equal variances.
  • Test Statistic: t=yˉ<em>1yˉ</em>2s<em>p(1n</em>1+1n<em>2)t = \frac{\bar{y}<em>1 - \bar{y}</em>2}{s<em>p\bigg(\frac{1}{n</em>1} + \frac{1}{n<em>2}\bigg)} with df=n</em>1+n22df = n</em>1 + n_2 - 2

Paired t-Test

  • Measures differences in one sample across two measurements.
  • Hypotheses:
    • Null: H<em>0:β</em>d=0H<em>0: \beta</em>d = 0
    • Alternative: H<em>1:β</em>d<br/>0H<em>1: \beta</em>d <br />\neq 0
  • Test Statistics: t=yˉ<em>dβ</em>ds<em>dn</em>dt = \frac{\bar{y}<em>d - \beta</em>d}{\frac{s<em>d}{n</em>d}}

Chi-Square Goodness of Fit

  • Used for a single categorical variable to compare proportions across categories.
  • Hypotheses:
    • Null (H<em>0H<em>0): Proportions are as defined (e.g., p</em>Monday=0.2p</em>Monday = 0.2 for all weekdays).
    • Alternative (H<em>1H<em>1): Not all proportions are as specified in H</em>0H</em>0.
  • Test Statistic: Calculated using:
    extChiSquare=extSUM(O<em>iE</em>i)2Eiext{Chi-Square} = ext{SUM}\frac{(O<em>i - E</em>i)^2}{E_i}
  • Conclusion based on p-value and degrees of freedom needed to evaluate the null hypothesis.