Data Cleaning and Hypothesis Testing Notes
Data Cleaning and Hypothesis Testing Notes
Population Notation Refresher
- When discussing samples, we use sample notation.
- For populations, we utilize population notation.
- Sample Statistics are used to estimate Population Parameters:
- Mean: yˉ (Sample Mean) vs. β (Population Mean)
- Median: ildey (Sample Median) vs. ildeβ (Population Median)
- Standard Deviation: s (Sample SD) vs. au (Population SD)
- Variance: s2 (Sample Variance) vs. au2 (Population Variance)
Hypothesis Testing Steps Refresher
- Hypotheses: Define the null (H<em>0) and alternative (H</em>1) hypotheses.
- Assumptions: Verify the assumptions of the test.
- Test Statistic: Calculate the test statistic.
- p-value: Determine the p-value from the test statistic distribution.
- Decision:
- If p < 0.05, reject H0.
- If p
ot< 0.05, do not reject H0.
- Conclusion: Summarize findings in relation to the population.
Significant Results
- A significance level of 5% (α = 0.05):
- If p < 0.05: Reject H0, indicating significant evidence against the null hypothesis.
Non-Significant Results
- A significance level of 5% (α = 0.05):
- If p > 0.05: Fail to reject H0, indicating inconclusive evidence.
Data Cleaning
- Graphical exploration and numerical summaries (counts, means, standard deviations) are vital for valid data.
- Problems in data can include:
- Incomplete: Missing values.
- Noisy: Presence of outliers or errors (e.g., Salary = "-10").
- Inconsistent: Discrepancies in data codes or formats (e.g., Age = "42", Birthday = "03/07/1997").
Data Preparation Steps
- Choose variables for analysis.
- Check data for correctness.
- Correct errors and address missing or extreme values.
- Format data appropriately (units, data types).
- Create new variables from existing data.
Missing Data Types
- Missing at Random: Missingness is random.
- Missing Not at Random: Missingness is systematically related to unobserved data.
- Examples:
- Battery dead = Missing at random
- Broken scale = Missing not at random
Outliers
- An outlier is a data point significantly different from others.
- Reasons to remove: if deemed faulty; reasons to keep: if they represent valid measurements.
Central Limit Theorem (CLT)
- The CLT states that: The mean of a random sample will have a sampling distribution approximately normal.
- Formula: extnauXˉ−β is approximately normal for large enough samples.
- A sample size of nextofatleast25 is generally sufficient unless the population is already normal.
One-Sample t-Test
- Hypotheses:
- Null (H<em>0): β=β</em>0
- Alternative (H<em>1): β=β</em>0
- Assumptions:
- Numeric scores.
- Independent observations.
- Approximately normally distributed data if n < 25.
- Test Statistic: t=extnsyˉ−β
Two-Sample t-Test
- Hypotheses:
- Null: H<em>0:β</em>1=β2
- Alternative: H<em>1:β</em>1<br/>=β2
- Assumptions include independent observations, normal distribution, and equal variances.
- Test Statistic: t=s<em>p(n</em>11+n<em>21)yˉ<em>1−yˉ</em>2 with df=n</em>1+n2−2
Paired t-Test
- Measures differences in one sample across two measurements.
- Hypotheses:
- Null: H<em>0:β</em>d=0
- Alternative: H<em>1:β</em>d<br/>=0
- Test Statistics: t=n</em>ds<em>dyˉ<em>d−β</em>d
Chi-Square Goodness of Fit
- Used for a single categorical variable to compare proportions across categories.
- Hypotheses:
- Null (H<em>0): Proportions are as defined (e.g., p</em>Monday=0.2 for all weekdays).
- Alternative (H<em>1): Not all proportions are as specified in H</em>0.
- Test Statistic: Calculated using:
extChi−Square=extSUMEi(O<em>i−E</em>i)2 - Conclusion based on p-value and degrees of freedom needed to evaluate the null hypothesis.