Data Cleaning and Hypothesis Testing Notes

Data Cleaning and Hypothesis Testing Notes

Population Notation Refresher

When discussing samples, we use sample notation.
For populations, we utilize population notation.
Sample Statistics are used to estimate Population Parameters:
- Mean: $\bar{y}$ (Sample Mean) vs. $\beta$ (Population Mean)
- Median: $ilde{y}$ (Sample Median) vs. $ilde{\beta}$ (Population Median)
- Standard Deviation: $s$ (Sample SD) vs. $au$ (Population SD)
- Variance: $s^2$ (Sample Variance) vs. $au^2$ (Population Variance)

Hypothesis Testing Steps Refresher

Hypotheses: Define the null ( $H<em>0$ ) and alternative ( $H</em>1$ ) hypotheses.
Assumptions: Verify the assumptions of the test.
Test Statistic: Calculate the test statistic.
p-value: Determine the p-value from the test statistic distribution.
Decision:
- If p < 0.05, reject $H_0$ .
- If p
  ot< 0.05, do not reject $H_0$ .
Conclusion: Summarize findings in relation to the population.

Significant Results

A significance level of 5% (α = 0.05):
- If p < 0.05: Reject $H_0$ , indicating significant evidence against the null hypothesis.

Non-Significant Results

A significance level of 5% (α = 0.05):
- If p > 0.05: Fail to reject $H_0$ , indicating inconclusive evidence.

Data Cleaning

Graphical exploration and numerical summaries (counts, means, standard deviations) are vital for valid data.
Problems in data can include:
- Incomplete: Missing values.
- Noisy: Presence of outliers or errors (e.g., Salary = "-10").
- Inconsistent: Discrepancies in data codes or formats (e.g., Age = "42", Birthday = "03/07/1997").

Data Preparation Steps

Choose variables for analysis.
Check data for correctness.
Correct errors and address missing or extreme values.
Format data appropriately (units, data types).
Create new variables from existing data.

Missing Data Types

Missing at Random: Missingness is random.
Missing Not at Random: Missingness is systematically related to unobserved data.
- Examples:
  - Battery dead = Missing at random
  - Broken scale = Missing not at random

Outliers

An outlier is a data point significantly different from others.
Reasons to remove: if deemed faulty; reasons to keep: if they represent valid measurements.

Central Limit Theorem (CLT)

The CLT states that: The mean of a random sample will have a sampling distribution approximately normal.
Formula: $\frac{\bar{X} - \beta}{\frac{ au}{ ext{n}}}$ is approximately normal for large enough samples.
A sample size of $n ext{ of at least } 25$ is generally sufficient unless the population is already normal.

One-Sample t-Test

Hypotheses:
- Null ( $H<em>0$ ): $\beta = \beta</em>0$
- Alternative ( $H<em>1$ ): $\beta \neq \beta</em>0$
Assumptions:
1. Numeric scores.
2. Independent observations.
3. Approximately normally distributed data if n < 25.
Test Statistic: $t = \frac{\bar{y} - \beta}{\frac{s}{ ext{n}}}$

Two-Sample t-Test

Hypotheses:
- Null: $H<em>0: \beta</em>1 = \beta_2$
- Alternative: $H<em>1: \beta</em>1 <br />\neq \beta_2$
Assumptions include independent observations, normal distribution, and equal variances.
Test Statistic: $t = \frac{\bar{y}<em>1 - \bar{y}</em>2}{s<em>p\bigg(\frac{1}{n</em>1} + \frac{1}{n<em>2}\bigg)}$ with $df = n</em>1 + n_2 - 2$

Paired t-Test

Measures differences in one sample across two measurements.
Hypotheses:
- Null: $H<em>0: \beta</em>d = 0$
- Alternative: $H<em>1: \beta</em>d <br />\neq 0$
Test Statistics: $t = \frac{\bar{y}<em>d - \beta</em>d}{\frac{s<em>d}{n</em>d}}$

Chi-Square Goodness of Fit

Used for a single categorical variable to compare proportions across categories.
Hypotheses:
- Null ( $H<em>0$ ): Proportions are as defined (e.g., $p</em>Monday = 0.2$ for all weekdays).
- Alternative ( $H<em>1$ ): Not all proportions are as specified in $H</em>0$ .
Test Statistic: Calculated using:
$ext{Chi-Square} = ext{SUM}\frac{(O<em>i - E</em>i)^2}{E_i}$
Conclusion based on p-value and degrees of freedom needed to evaluate the null hypothesis.