Data Cleaning and Hypothesis Testing Notes
Data Cleaning and Hypothesis Testing
Population Notation
When analyzing data, it's crucial to distinguish between sample and population notation:
Sample Notation: Used when referring to a subset of the population.
Population Notation: Used when referring to the entire group of interest.
Sample statistics are used to estimate population parameters.
Statistic | Sample Notation | Population Notation |
|---|---|---|
Mean | ||
Median | ||
Standard Deviation | ||
Variance |
Hypothesis Testing Steps
(H) Hypotheses: State the null hypothesis () and the alternative hypothesis () in terms of the population parameter of interest.
(A) Assumptions: Verify that the underlying assumptions of the statistical test are met.
(T) Test Statistic: Calculate the appropriate test statistic.
(P) p-value: Determine the p-value from the distribution of the test statistic.
(D) Decision: Compare the p-value to the significance level (). If the p-value is less than (typically 0.05), reject the null hypothesis. Otherwise, do not reject the null hypothesis.
(C) Conclusion: Provide a conclusion in the context of the original research question, referring back to the target population.
Significant Result
Using a significance level of 5% ():
If p-value < 0.05, reject . This indicates there is statistically significant evidence against the null hypothesis.
Non-Significant Result
Using a significance level of 5% ():
If p-value > 0.05, fail to reject . This indicates there is no statistically significant evidence against the null hypothesis, leading to an inconclusive result.
Data Cleaning
Graphical explorations of variables are helpful.
Numerical summaries (counts, means, standard deviations) enhance understanding.
Valid data is essential for business insights and informed decision-making.
Types of Problems:
Incomplete: Missing values in variables.
Noisy: Errors or outliers in the data (e.g., Salary=“-10”).
Inconsistent: Discrepancies in codes or names (e.g., Age=“42” Birthday=“03/07/1997”).
How to prepare
Choosing variables for analysis.
Checking data for correctness.
Correcting errors, handling missing values, and addressing extreme values.
Deciding data format (unit of measurement, data type, number of categories).
Creating new variables from existing ones.
Missing Data
Missing at Random: The reason for missing data is random and not related to the unobserved data.
Missing Not at Random: The missing data is systematically related to unobserved factors or events not measured by the researcher.
Examples
Missing at Random: Weighing scale ran out of batteries or sometimes works on carpet.
Missing Not at Random: A broken weighing scale.
Outliers
An outlier is an observation that lies an abnormal distance from other values. Outliers should only be removed if there is a valid reason to believe the data is faulty. Keep outliers if they represent reliable measurements.
Central Limit Theorem (CLT)
The CLT is useful when dealing with non-normally distributed variables. According to the Central Limit Theorem: will be approximately normal.
Formal Definition
The mean of a random sample has a sampling distribution whose shape can be approximated by a Normal distribution. The larger the sample, the better the approximation will be.
Sample Size
If the original population is approximately normal, a sample size of at least 25 is generally considered large enough to assume an approximate Normal distribution for sample means.
If the original population has a Normal distribution, then sample means will follow an exact Normal distribution, regardless of sample size.
One-Sample t-Test
A one-sample t-test is used when comparing the mean of a single sample to a hypothesized population mean.
Null Hypothesis (): (There is no difference between the population mean and the hypothesized mean.)
Alternative Hypothesis (): (There is a difference between the population mean and the hypothesized mean.)
Assumptions
Scores are numeric
Observations are independent
Observations are approximately normally distributed (not too skewed)
If the original population is not too far from Normal, an n of at least 25 will be ‘large enough’ to assume an approximate Normal distribution for sample means (i.e. the CLT will apply).
If the original population is a Normal distribution, then sample means will follow an exact Normal distribution, regardless of sample size (n) (i.e. we don’t need the CLT).
Hypothesis Testing Steps
(H) Hypotheses: State the null and the alternative hypotheses in terms of the parameter of interest.
(A) Assumptions: Check the underlying assumptions of the test.
(T) Test Statistic: Calculate the test statistic.
(P) p-value: Obtain the p-value for the test from the distribution of the test statistic.
(D) Decision: If the p-value is less than 0.05 (the significance level), reject the null hypothesis. If the p-value is not less than 0.05, do not reject the null hypothesis.
(C) Conclusion: Write a conclusion to the original research question in terms of the target population.
Test Statistics
The test statistic for a one-sample t-test is:
where is the sample mean, is the hypothesized population mean, is the sample standard deviation, and is the sample size. This statistic follows a Student's t-distribution.
Two-Sample t-Test
A two-sample t-test is used to compare the means of two independent populations.
Null Hypothesis (): (There is no difference between the two population means.)
Alternative Hypothesis (): (There is a difference between the two population means.)
Assumptions
For a two-sample t-test, the following assumptions should be met:
The observations are independent.
The observations are approximately normally distributed (not too skewed).
The variances of the two independent groups are equal.
Hypothesis Testing Steps
(H) Hypotheses: State the null and the alternative hypotheses in terms of the parameter of interest.
(A) Assumptions: Check the underlying assumptions of the test.
(T) Test Statistic: Calculate the test statistic.
(P) p-value: Obtain the p-value for the test from the distribution of the test statistic.
(D) Decision: If the p-value is less than 0.05 (the significance level), reject the null hypothesis. If the p-value is not less than 0.05, do not reject the null hypothesis.
(C) Conclusion: Write a conclusion to the original research question in terms of the target population.
Test Statistics (Equal Variance)
The test statistic for a two-sample t-test, assuming equal variances, is:
where and are the sample means, and are the sample sizes, and is the pooled standard deviation. The degrees of freedom are calculated as .
The pooled standard deviation is calculated as:
where and are the sample variances.
Paired t-Test
The paired t-test is used when there are two measurements for each participant. This is effectively a one sample t-test on the differences between matched pairs.
Hypotheses
(The mean difference is zero.)
(The mean difference is not zero.)
Hypothesis Testing Steps
(H) Hypotheses: State the null and the alternative hypotheses in terms of the parameter of interest.
(A) Assumptions: Check the underlying assumptions of the test.
(T) Test Statistic: Calculate the test statistic.
(P) p-value: Obtain the p-value for the test from the distribution of the test statistic.
(D) Decision: If the p-value is less than 0.05 (the significance level), reject the null hypothesis. If the p-value is not less than 0.05, do not reject the null hypothesis.
(C) Conclusion: Write a conclusion to the original research question in terms of the target population.
The test statistic for the paired t-test is:
where is the mean of the differences, is the hypothesized mean difference (usually 0), is the standard deviation of the differences, and is the number of differences. The degrees of freedom are .
Chi-Square Goodness-of-Fit Test
This test is used to determine if the observed proportions for a single categorical variable differ significantly from expected proportions.
Single categorical variable (nominal or ordinal).
Research question/hypothesis compares the proportion of observations across categories to something (either even split or some other breakdown).
Example
Research Question: Are patterns of absenteeism the same on all weekdays?
Hypotheses
= proportion sick on Mondays in the population.
= proportion sick on Tuesdays in the population.
…
= proportion sick on Fridays in the population.
The null hypothesis must account for each weekday and their hypothesized proportions must add to one.
i.e., : The proportion sick is the same on each weekday.
: Not all the proportions are as stated in the null hypothesis.
Expected Values
Expected sick leave = = sample size × hypothesized probability in each group
Assumptions
The test is only valid if all expected counts are 5.
Test Statistic
The test statistic is calculated by
where is the observed frequency and is the expected frequency for each category.
P-Value
Degrees of freedom = number of categories – 1
Excel Code: =CHISQ.DIST.RT([Test Statistic], [Degrees of Freedom])
Conclusion
If the p-value < 0.05: The proportions are not as claimed in the null hypothesis.
If the p-value > 0.05: There is no evidence against the proportions are as claimed in the null hypothesis.