Hypothesis Testing Notes (STTN327, Unit 04)

Hypothesis Testing: Key Concepts and R Implementation (STTN327, Unit 04)

Purpose: test whether population parameters (means, proportions) align with hypothesized values or relationships between groups.
Types covered:
- Mean of 1 population (μ)
- Means of 2 independent populations (μ1, μ2)
- Means of 2 dependent (paired) populations (μD)
- Non-parametric tests ( Wilcoxon )
- Chi-squared tests (Goodness-of-fit and Independence)
R focus: primary functions and typical options; how to interpret outputs and check assumptions.

1) Hypothesis tests for the mean of 1 population (µ)

Basic one-sample t-test in R:
- Function: $t.test(X)$
- Null hypothesis: $H0: \mu = \mu0$ (default $\mu_0 = 0$ if not specified)
- Alternatives: $alternative = {"less", "greater", "two.sided"}$
- Specifying a hypothesized mean: $mu = \mu_0$
- Notes: Type "?t.test" for more options.
Key formula (one-sample):
- Test statistic: $t = \frac{\bar{X} - \mu_0}{s/\sqrt{n}}$
- Degrees of freedom: $df = n - 1$
- Interpretation: p-value assesses evidence against $H0: \mu = \mu0$ in favor of the specified alternative.
Example: Test if population mean body temperature is > 36 using 5 observations.
- Data: $temp = c(33, 38, 37, 39, 36)$
- R command: $t.test(temp, alternative = "greater", mu = 36)$
- Output (interpreted):
- t = 0.58277, df = 4, p-value = 0.2957
- Alternative: true mean is greater than 36
- 95% CI: (34.40513, Inf)
- Sample mean: 36.6
- Takeaway: with this small sample, there is not enough evidence to conclude that the population mean exceeds 36 at the 5% level.

2) Testing for normality (prerequisite for t-tests)

Why test normality?
- The t-test assumes that the underlying population is approximately normal (especially relevant for small samples).
- If the normality assumption is violated, p-values from t.test may not be trustworthy; consider non-parametric methods.
Graphical tests (to assess normality):
- Normal QQ-plot (qqnorm + qqline)
- Box-plot (less definitive for normality)
- Histogram with kernel density and overlaid normal density
Example (graphical):
- Data: temp <- c(33, 38, 37, 39, 36)
- Commands:
- QQ-plot: qqnorm(temp);\qqline(temp)
- Box-plot: $boxplot(temp)$
- Histogram with overlay: $hist(temp, freq = FALSE)$
 - Add density plot: d <- density(temp); lines(d, lty = 2)
 - Overlay normal density: lines(d$x, dnorm(d$x, mean(temp), sd(temp)))
Formal test: Shapiro-Wilk test
- Command: $shapiro.test(X)$
- Null: $H_0: X$ comes from a normally distributed population; Alternative: not normal.
- Example: temp <- c(33, 38, 37, 39, 36)
- Output: W = 0.9427, p-value = 0.6853
- Interpretation: fail to reject normality (in this example).
Conclusion about normality:
- If data are from a normal population, p-values from t.test can be trusted.
- If not normal, consider non-parametric methods (e.g., Wilcoxon tests) instead of t-tests.

3) Hypothesis tests for 2 independent populations (μ1 vs μ2)

Null and alternative:
- $H0: \mu1 = \mu2$ vs $HA: \mu1 \neq \mu2$ (two-sided; can be one-sided with "less" or "greater").
Basic two-sample t-test in R:
- Command variants:
- $t.test(X1, X2)$ where X1 and X2 are the two samples
- Or: $t.test(X \sim grp)$ where X is the pooled data and grp is a grouping variable
- Options:
- alternative = {\"less\", \"greater\", \"two.sided\"}
- $var.equal = TRUE|FALSE$ (assume equal variances or not; Welch's t-test if FALSE)
- Help: Type "?t.test" for more options.
Assumptions to check before the t-test (two independent samples):
- Normality of the pooled data (or each group) – assess by pooling centered and scaled data to test normality:
- Center and scale each sample:
 - $X^* = \frac{Xi - \bar{X}}{sX}$ for group 1,
 - $Y^* = \frac{Yj - \bar{Y}}{sY}$ for group 2
 - $\bar{X} = \frac{1}{n1}\sum Xi, \quad sX^2 = \frac{1}{n1-1}\sum (X_i - \bar{X})^2$
 - $\bar{Y} = \frac{1}{n2}\sum Yj, \quad sY^2 = \frac{1}{n2-1}\sum (Y_j - \bar{Y})^2$
- Pool the standardized values and test normality on the pooled sample (why? to ensure the pooled distribution is normal under H0).
- Equal variances: test $H0: \sigma1^2 = \sigma2^2$ vs $HA: \sigma1^2 \neq \sigma2^2$
- Command: $var.test(X1, X2)$ or $var.test(X \sim grp)$
Example: Body temperatures (Men vs Women)
- Data:
- tempM <- c(37, 39, 36, 34, 35) (Men)
- tempF <- c(33, 35, 33, 34) (Women)
- Normality check (pooled after centering and scaling):
- tempM.s = (tempM - mean(tempM)) / sd(tempM)
- tempF.s = (tempF - mean(tempF)) / sd(tempF)
- pooled <- c(tempM.s, tempF.s)
- shapiro.test(pooled) # Example: W = 0.9140, p-value = 0.345 (illustrative)
- Equal variances: var.test(tempF, tempM) # Example statistic: F = 4.0364, p-value = 0.281
- Two-sample t-test (assuming equal variances):
- Command: $t.test(tempF, tempM, alternative = \"less\", var.equal = TRUE)$
- Output: t = -2.3066, df = 7, p-value = 0.02723; means: tempF ≈ 33.75, tempM ≈ 36.20
- Interpretation: evidence that mean temperature of women is less than that of men (at typical significance levels).

4) Hypothesis tests for 2 dependent (paired) populations (μD)

Setup: test if mean difference between paired samples is zero.
Null/Alternative:
- $H0: \muD = 0$ vs $HA: \muD \neq 0$ (two-sided), or one-sided depending on hypothesis.
Basic paired t-test in R:
- Command variants:
- $t.test(X1, X2, paired = TRUE)$
- Or: $t.test(X \sim grp, paired = TRUE)$ where grp encodes pairing
- Or: define differences: $Xd = X2 - X1$ and run $t.test(Xd)$
- The paired t-test statistic:
- Let differences $di = X{2i} - X{1i}$ , with mean \bar{d} and SD $sd$, then
 $t = \frac{\bar{d}}{s_d / \sqrt{n}}$ with $df = n-1$
Example: Body temperature before vs after sleeping (n = 5)
- Data:
- Before: tempBef = c(38, 39, 36, 34, 35)
- After: tempAft = c(36, 39, 35, 35, 35)
- Tests (any of the three approaches will work):
- $t.test(tempAft, tempBef, paired = TRUE, alternative = \"less\")$
- $t.test(tempAft - tempBef, alternative = \"less\")$
- Or: poolTemp <- c(tempAft, tempBef); grp <- rep(1:2, each = 5); $t.test(poolTemp ~ grp, paired = TRUE, alternative = \"less\")$
- Common output:
- Paired t-test; t = -0.7845, df = 4, p-value = 0.2383
- Alternative: true difference in means is less than 0
- 95% CI: (-Inf, 0.6870)
- Mean difference: -0.4
Interpretation: with this data, there is not enough evidence that after sleeping temperatures are lower than before (at 5% level).

5) Non-parametric hypothesis tests (Wilcoxon tests)

When to use: when normality assumptions are questionable or sample sizes are small.
Basic Wilcoxon tests in R:
- One-sample: $wilcox.test(X)$
- Two-sample: $wilcox.test(X1, X2)$ or $wilcox.test(X \sim grp)$
Relationship to t-test: NB: Wilcoxon tests work similarly to t.test in terms of interpretation for median (not mean) and are robust to non-normal data.
Help: Type "?wilcox.test" for options.
Example exercises (from slides):
- Repeat the exercise from last week using Wilcoxon tests on data in taste.txt:
- Test whether mean score of Green pudding is greater than 35.
- Test whether Green and Brown puddings differ in mean score.
Note: Wilcoxon tests can be a drop-in replacement for t-tests in some analyses, but interpret medians/rank-based results rather than means.

6) Chi-squared tests in R: Goodness-of-fit and Independence

Purpose: compare observed frequencies to expected frequencies under a specified distribution (goodness-of-fit) or test independence between two categorical variables (two-way tables).
Basic one-way chi-squared test (goodness-of-fit):
- Function: $chisq.test(X)$ where X is a vector of observed frequencies, or $chisq.test(table(X))$ from raw data.
- You can specify the expected probabilities with: $chisq.test(X, p = pvec)$ where $pvec$ contains the expected probabilities summing to 1.
- Null: $H0: Oi = E_i$ (observed frequencies follow the specified distribution) for all categories.
- Test statistic: $\chi^2 = \sum{i=1}^k \frac{(Oi - Ei)^2}{Ei}$
Example (eye colors):
- Observed: $ObsFreqs = (17, 65, 18)$ for Green, Blue, Brown
- Expected (under some distribution): $ExpFreqs = (25, 50, 25)$
- Compute ExpFreqs.p = ExpFreqs / sum(ExpFreqs) to convert to probabilities, then run:
- $chisq.test(ObsFreqs, p = ExpFreqs.p)$
- Output example: X-squared = 9.02, df = 2, p-value = 0.011
Example: survey data with possible warning about approximation
- Data: read in survey.csv and construct a table with Smoking vs Exercise; run:
- tbl <- table(survey$Smoke, survey$Exer)
- chisq.test(tbl)
- Possible warning: "Chi-squared approximation may be incorrect" if some expected counts < 5.
- Remedy: combine categories to increase expected counts (e.g., merge None/Some exercise into a single category).
- Example adjustment: combo <- tbl[,"None"] + tbl[,"Some"]; newtbl <- cbind(tbl[,"Freq"], combo); chisq.test(newtbl)
Two-way chi-squared test (tests of independence):
- Raw data approach:
- E.g., two factors Hair and Eyes; build vectors: Hair <- factor(…), Eyes <- factor(…)
- Run: $chisq.test(Hair, Eyes)$ or convert to a matrix:
 - tbl <- table(Hair, Eyes)
 - chisq.test(tbl)
- Output example (Hair vs Eyes): X-squared = 1.9658, df = 4, p-value = 0.7421
- Tabulated data approach (matrix input):
- ObsFreqMat <- matrix(c(6,8,4,7,6,6,4,6,7), ncol=3, byrow=TRUE)
- chisq.test(ObsFreqMat)
- Output: X-squared = 1.9658, df = 4, p-value = 0.7421
Warnings and caveats: chi-squared approximation may be incorrect when expected counts are small (<5). Remedies include data aggregation or alternative tests (e.g., Fisher’s exact test) not covered in slides but common in practice.
Additional data/examples referenced in exercises:
- Example 1: heads.csv (fair coin check) – test whether data are consistent with p = 1/2.
- Example 2: survey on Opinion x Personnel type – test independence between two categorical variables.
- Example 3: Mathematics achievement by sex (Open University 1983) – test for relationship between sex and achievement; also asks to implement a custom chisq.test-like function.
- Example 4: Red vs Blue colors in sports outcomes – test whether red wins are 50/50 across sports; test distribution similarity across sports.

7) Worked datasets and homework/exercises mentioned in slides

taste.txt: used for multiple hypothesis-testing exercises (1-sample and 2-sample Wilcoxon/t-test variants). Tasks include:
- Is Brown pudding mean score higher than Green pudding mean score?
- Check all assumptions graphically and formally.
iceRICEp423.txt: test whether method B heat of fusion is significantly lower than method A; check all assumptions.
fishmercuryRICEp451.txt: compare mercury levels between Selective Reduction vs Permanganate methods; also check subset where both > 0.4.
heads.csv: used in an example for coin-toss fairness (Youden dataset) with 9207 heads and 8743 tails grouped in fives; tests for fairness (p = 0.5).
red-blue.xls: dataset about wearing color and contest outcomes; used for hypothesis testing of color effects on winning probability and comparing across sports.
survey.csv: smoking and exercise data; used to illustrate chi-squared test for independence and to demonstrate warning messages and aggregation.
Tips and resources:
- When chi-squared warnings occur, consider combining rows/columns or using alternative tests.
- The eFundi resource and online tutorials cited in slides provide examples and code variations.

8) Quick reference: key outputs and interpretation

One-sample t-test: t-statistic, df = n - 1, p-value, 95% CI for the mean, and sample mean. Use to test whether the population mean equals a hypothesized value.
Two-sample t-test: compare two means; equal vs unequal variances (var.equal option). Look at t-statistic, df, p-value, and confidence interval for the difference of means.
Paired t-test: tests mean difference in paired observations; use when data are naturally matched.
Shapiro-Wilk: W statistic and p-value; used to assess normality. A small p-value suggests non-normality.
Wilcoxon tests: non-parametric alternatives to t-tests (test medians/ranks rather than means).
Chi-squared: assess goodness-of-fit or independence.
- Goodness-of-fit: compare observed frequencies to expected under a specified distribution.
- Independence: test whether two categorical variables are independent in a contingency table.
Warnings: small expected frequencies (<5) undermine chi-squared approximations; remedy by combining categories or using exact tests.

9) Key formulas recap (LaTeX)

One-sample t-statistic:
$t = \frac{\bar{X} - \mu_0}{s/\sqrt{n}}\quad df = n-1$
Two-sample t-statistic (equal variances):
$sp^2 = \frac{(n1-1)s1^2 + (n2-1)s2^2}{n1+n2-2}$ $t = \frac{\bar{X}1 - \bar{X}2}{sp \sqrt{\frac{1}{n1} + \frac{1}{n2}}}$
Paired t-statistic:
$di = X{2i} - X{1i}, \quad \bar{d} = \frac{1}{n}\sum di, \quad sd^2 = \frac{1}{n-1}\sum (di - \bar{d})^2$
$t = \frac{\bar{d}}{s_d/\sqrt{n}}\quad df = n-1$
Chi-squared (goodness-of-fit or independence):
$\chi^2 = \sum{i=1}^k \frac{(Oi - Ei)^2}{Ei}, \quad df = k - 1 \quad(또는\text{ for contingency tables } df = (r-1)(c-1))$
Normality: Shapiro-Wilk test statistic W with p-value; used to assess normality assumption for t-tests.

// End of notes. Use these sections and formulas as a comprehensive study aid for Hypothesis Testing in the unit.