Data Cleaning and Hypothesis Testing Notes

Data Cleaning and Hypothesis Testing

Population Notation

When analyzing data, it's crucial to distinguish between sample and population notation:

Sample Notation: Used when referring to a subset of the population.
Population Notation: Used when referring to the entire group of interest.

Sample statistics are used to estimate population parameters.

Statistic	Sample Notation	Population Notation
Mean	$\bar{y}$	$\mu$
Median	$\tilde{y}$	$\tilde{\mu}$
Standard Deviation	$s$	$\sigma$
Variance	$s^2$	$\sigma^2$

Hypothesis Testing Steps

(H) Hypotheses: State the null hypothesis ( $H0$ ) and the alternative hypothesis ( $H1$ ) in terms of the population parameter of interest.
(A) Assumptions: Verify that the underlying assumptions of the statistical test are met.
(T) Test Statistic: Calculate the appropriate test statistic.
(P) p-value: Determine the p-value from the distribution of the test statistic.
(D) Decision: Compare the p-value to the significance level ( $\alpha$ ). If the p-value is less than $\alpha$ (typically 0.05), reject the null hypothesis. Otherwise, do not reject the null hypothesis.
(C) Conclusion: Provide a conclusion in the context of the original research question, referring back to the target population.

Significant Result

Using a significance level of 5% ( $\alpha = 0.05$ ):

If p-value < 0.05, reject $H_0$ . This indicates there is statistically significant evidence against the null hypothesis.

Non-Significant Result

Using a significance level of 5% ( $\alpha = 0.05$ ):

If p-value > 0.05, fail to reject $H_0$ . This indicates there is no statistically significant evidence against the null hypothesis, leading to an inconclusive result.

Data Cleaning

Graphical explorations of variables are helpful.
Numerical summaries (counts, means, standard deviations) enhance understanding.
Valid data is essential for business insights and informed decision-making.

Types of Problems:

Incomplete: Missing values in variables.
Noisy: Errors or outliers in the data (e.g., Salary=“-10”).
Inconsistent: Discrepancies in codes or names (e.g., Age=“42” Birthday=“03/07/1997”).

How to prepare

Choosing variables for analysis.
Checking data for correctness.
Correcting errors, handling missing values, and addressing extreme values.
Deciding data format (unit of measurement, data type, number of categories).
Creating new variables from existing ones.

Missing Data

Missing at Random: The reason for missing data is random and not related to the unobserved data.
Missing Not at Random: The missing data is systematically related to unobserved factors or events not measured by the researcher.

Examples

Missing at Random: Weighing scale ran out of batteries or sometimes works on carpet.
Missing Not at Random: A broken weighing scale.

Outliers

An outlier is an observation that lies an abnormal distance from other values. Outliers should only be removed if there is a valid reason to believe the data is faulty. Keep outliers if they represent reliable measurements.

Central Limit Theorem (CLT)

The CLT is useful when dealing with non-normally distributed variables. According to the Central Limit Theorem: $\frac{\bar{X} - \mu}{\sigma/\sqrt{n}}$ will be approximately normal.

Formal Definition

The mean of a random sample has a sampling distribution whose shape can be approximated by a Normal distribution. The larger the sample, the better the approximation will be.

Sample Size

If the original population is approximately normal, a sample size of at least 25 is generally considered large enough to assume an approximate Normal distribution for sample means.
If the original population has a Normal distribution, then sample means will follow an exact Normal distribution, regardless of sample size.

One-Sample t-Test

A one-sample t-test is used when comparing the mean of a single sample to a hypothesized population mean.

Null Hypothesis ( $H0$ ): $\mu = \mu0$ (There is no difference between the population mean and the hypothesized mean.)
Alternative Hypothesis ( $H1$ ): $\mu \neq \mu0$ (There is a difference between the population mean and the hypothesized mean.)

Assumptions

Scores are numeric
Observations are independent
Observations are approximately normally distributed (not too skewed)
- If the original population is not too far from Normal, an n of at least 25 will be ‘large enough’ to assume an approximate Normal distribution for sample means (i.e. the CLT will apply).
- If the original population is a Normal distribution, then sample means will follow an exact Normal distribution, regardless of sample size (n) (i.e. we don’t need the CLT).

Hypothesis Testing Steps

(H) Hypotheses: State the null and the alternative hypotheses in terms of the parameter of interest.
(A) Assumptions: Check the underlying assumptions of the test.
(T) Test Statistic: Calculate the test statistic.
(P) p-value: Obtain the p-value for the test from the distribution of the test statistic.
(D) Decision: If the p-value is less than 0.05 (the significance level), reject the null hypothesis. If the p-value is not less than 0.05, do not reject the null hypothesis.
(C) Conclusion: Write a conclusion to the original research question in terms of the target population.

Test Statistics

The test statistic for a one-sample t-test is:

$t = \frac{\bar{y} - \mu}{\frac{s}{\sqrt{n}}}$

where $\bar{y}$ is the sample mean, $\$mu$ is the hypothesized population mean, $s$ is the sample standard deviation, and $n$ is the sample size. This statistic follows a Student's t-distribution.

Two-Sample t-Test

A two-sample t-test is used to compare the means of two independent populations.

Null Hypothesis ( $H0$ ): $\mu1 = \mu_2$ (There is no difference between the two population means.)
Alternative Hypothesis ( $H1$ ): $\mu1 \neq \mu_2$ (There is a difference between the two population means.)

Assumptions

For a two-sample t-test, the following assumptions should be met:

The observations are independent.
The observations are approximately normally distributed (not too skewed).
The variances of the two independent groups are equal.

Hypothesis Testing Steps

(H) Hypotheses: State the null and the alternative hypotheses in terms of the parameter of interest.
(A) Assumptions: Check the underlying assumptions of the test.
(T) Test Statistic: Calculate the test statistic.
(P) p-value: Obtain the p-value for the test from the distribution of the test statistic.
(D) Decision: If the p-value is less than 0.05 (the significance level), reject the null hypothesis. If the p-value is not less than 0.05, do not reject the null hypothesis.
(C) Conclusion: Write a conclusion to the original research question in terms of the target population.

Test Statistics (Equal Variance)

The test statistic for a two-sample t-test, assuming equal variances, is:

$t = \frac{\bar{y}1 - \bar{y}2}{sp \sqrt{\frac{1}{n1} + \frac{1}{n_2}}}$

where $\bar{y}1$ and $\bar{y}2$ are the sample means, $n1$ and $n2$ are the sample sizes, and $sp$ is the pooled standard deviation. The degrees of freedom are calculated as $df = n1 + n_2 - 2$ .

The pooled standard deviation is calculated as:

$sp = \sqrt{\frac{s1^2(n1 - 1) + s2^2(n2 - 1)}{n1 + n_2 - 2}}$

where $s1^2$ and $s2^2$ are the sample variances.

Paired t-Test

The paired t-test is used when there are two measurements for each participant. This is effectively a one sample t-test on the differences between matched pairs.

Hypotheses

$H0: \mud = 0$ (The mean difference is zero.)
$H1: \mud \neq 0$ (The mean difference is not zero.)

Hypothesis Testing Steps

(H) Hypotheses: State the null and the alternative hypotheses in terms of the parameter of interest.
(A) Assumptions: Check the underlying assumptions of the test.
(T) Test Statistic: Calculate the test statistic.
(P) p-value: Obtain the p-value for the test from the distribution of the test statistic.
(D) Decision: If the p-value is less than 0.05 (the significance level), reject the null hypothesis. If the p-value is not less than 0.05, do not reject the null hypothesis.
(C) Conclusion: Write a conclusion to the original research question in terms of the target population.

The test statistic for the paired t-test is:

$t = \frac{\bar{y}d - \mud}{\frac{sd}{\sqrt{nd}}}$

where $\bar{y}d$ is the mean of the differences, $\mud$ is the hypothesized mean difference (usually 0), $sd$ is the standard deviation of the differences, and $nd$ is the number of differences. The degrees of freedom are $n_d - 1$ .

Chi-Square Goodness-of-Fit Test

This test is used to determine if the observed proportions for a single categorical variable differ significantly from expected proportions.

Single categorical variable (nominal or ordinal).
Research question/hypothesis compares the proportion of observations across categories to something (either even split or some other breakdown).

Example

Research Question: Are patterns of absenteeism the same on all weekdays?

Hypotheses

$p_{Monday}$ = proportion sick on Mondays in the population.
$p_{Tuesday}$ = proportion sick on Tuesdays in the population.
…
$p_{Friday}$ = proportion sick on Fridays in the population.

The null hypothesis must account for each weekday and their hypothesized proportions must add to one.

$H0: p{Mon} = 0.2, p{Tues} = 0.2, p{Wed} = 0.2, p{Thurs} = 0.2, p{Fri} = 0.2$

i.e., $H_0$ : The proportion sick is the same on each weekday.

$H_1$ : Not all the proportions are as stated in the null hypothesis.

Expected Values

Expected sick leave = $n \times p_i$ = sample size × hypothesized probability in each group

Assumptions

The test is only valid if all expected counts are $\geq$ 5.

Test Statistic

The test statistic is calculated by

$\chi^2 = \sum{i} \frac{(Oi - Ei)^2}{Ei}$

where $Oi$ is the observed frequency and $Ei$ is the expected frequency for each category.

P-Value

Degrees of freedom = number of categories – 1

Excel Code: =CHISQ.DIST.RT([Test Statistic], [Degrees of Freedom])

Conclusion

If the p-value < 0.05: The proportions are not as claimed in the null hypothesis.
If the p-value > 0.05: There is no evidence against the proportions are as claimed in the null hypothesis.