Hypothesis Testing Notes (STTN327, Unit 04)
Hypothesis Testing: Key Concepts and R Implementation (STTN327, Unit 04)
Purpose: test whether population parameters (means, proportions) align with hypothesized values or relationships between groups.
Types covered:
Mean of 1 population (μ)
Means of 2 independent populations (μ1, μ2)
Means of 2 dependent (paired) populations (μD)
Non-parametric tests ( Wilcoxon )
Chi-squared tests (Goodness-of-fit and Independence)
R focus: primary functions and typical options; how to interpret outputs and check assumptions.
1) Hypothesis tests for the mean of 1 population (µ)
Basic one-sample t-test in R:
Function:
Null hypothesis: (default if not specified)
Alternatives:
Specifying a hypothesized mean:
Notes: Type "?t.test" for more options.
Key formula (one-sample):
Test statistic:
Degrees of freedom:
Interpretation: p-value assesses evidence against in favor of the specified alternative.
Example: Test if population mean body temperature is > 36 using 5 observations.
Data:
R command:
Output (interpreted):
t = 0.58277, df = 4, p-value = 0.2957
Alternative: true mean is greater than 36
95% CI: (34.40513, Inf)
Sample mean: 36.6
Takeaway: with this small sample, there is not enough evidence to conclude that the population mean exceeds 36 at the 5% level.
2) Testing for normality (prerequisite for t-tests)
Why test normality?
The t-test assumes that the underlying population is approximately normal (especially relevant for small samples).
If the normality assumption is violated, p-values from t.test may not be trustworthy; consider non-parametric methods.
Graphical tests (to assess normality):
Normal QQ-plot (qqnorm + qqline)
Box-plot (less definitive for normality)
Histogram with kernel density and overlaid normal density
Example (graphical):
Data: temp <- c(33, 38, 37, 39, 36)
Commands:
QQ-plot: qqnorm(temp);\qqline(temp)
Box-plot:
Histogram with overlay:
Add density plot: d <- density(temp); lines(d, lty = 2)
Overlay normal density: lines(d$x, dnorm(d$x, mean(temp), sd(temp)))
Formal test: Shapiro-Wilk test
Command:
Null: comes from a normally distributed population; Alternative: not normal.
Example: temp <- c(33, 38, 37, 39, 36)
Output: W = 0.9427, p-value = 0.6853
Interpretation: fail to reject normality (in this example).
Conclusion about normality:
If data are from a normal population, p-values from t.test can be trusted.
If not normal, consider non-parametric methods (e.g., Wilcoxon tests) instead of t-tests.
3) Hypothesis tests for 2 independent populations (μ1 vs μ2)
Null and alternative:
vs (two-sided; can be one-sided with "less" or "greater").
Basic two-sample t-test in R:
Command variants:
where X1 and X2 are the two samples
Or: where X is the pooled data and grp is a grouping variable
Options:
alternative = {\"less\", \"greater\", \"two.sided\"}
(assume equal variances or not; Welch's t-test if FALSE)
Help: Type "?t.test" for more options.
Assumptions to check before the t-test (two independent samples):
Normality of the pooled data (or each group) – assess by pooling centered and scaled data to test normality:
Center and scale each sample:
for group 1,
for group 2
Pool the standardized values and test normality on the pooled sample (why? to ensure the pooled distribution is normal under H0).
Equal variances: test vs
Command: or
Example: Body temperatures (Men vs Women)
Data:
tempM <- c(37, 39, 36, 34, 35) (Men)
tempF <- c(33, 35, 33, 34) (Women)
Normality check (pooled after centering and scaling):
tempM.s = (tempM - mean(tempM)) / sd(tempM)
tempF.s = (tempF - mean(tempF)) / sd(tempF)
pooled <- c(tempM.s, tempF.s)
shapiro.test(pooled) # Example: W = 0.9140, p-value = 0.345 (illustrative)
Equal variances: var.test(tempF, tempM) # Example statistic: F = 4.0364, p-value = 0.281
Two-sample t-test (assuming equal variances):
Command:
Output: t = -2.3066, df = 7, p-value = 0.02723; means: tempF ≈ 33.75, tempM ≈ 36.20
Interpretation: evidence that mean temperature of women is less than that of men (at typical significance levels).
4) Hypothesis tests for 2 dependent (paired) populations (μD)
Setup: test if mean difference between paired samples is zero.
Null/Alternative:
vs (two-sided), or one-sided depending on hypothesis.
Basic paired t-test in R:
Command variants:
Or: where grp encodes pairing
Or: define differences: and run
The paired t-test statistic:
Let differences , with mean \bar{d} and SD $sd$, then
with
Example: Body temperature before vs after sleeping (n = 5)
Data:
Before: tempBef = c(38, 39, 36, 34, 35)
After: tempAft = c(36, 39, 35, 35, 35)
Tests (any of the three approaches will work):
Or: poolTemp <- c(tempAft, tempBef); grp <- rep(1:2, each = 5);
Common output:
Paired t-test; t = -0.7845, df = 4, p-value = 0.2383
Alternative: true difference in means is less than 0
95% CI: (-Inf, 0.6870)
Mean difference: -0.4
Interpretation: with this data, there is not enough evidence that after sleeping temperatures are lower than before (at 5% level).
5) Non-parametric hypothesis tests (Wilcoxon tests)
When to use: when normality assumptions are questionable or sample sizes are small.
Basic Wilcoxon tests in R:
One-sample:
Two-sample: or
Relationship to t-test: NB: Wilcoxon tests work similarly to t.test in terms of interpretation for median (not mean) and are robust to non-normal data.
Help: Type "?wilcox.test" for options.
Example exercises (from slides):
Repeat the exercise from last week using Wilcoxon tests on data in taste.txt:
Test whether mean score of Green pudding is greater than 35.
Test whether Green and Brown puddings differ in mean score.
Note: Wilcoxon tests can be a drop-in replacement for t-tests in some analyses, but interpret medians/rank-based results rather than means.
6) Chi-squared tests in R: Goodness-of-fit and Independence
Purpose: compare observed frequencies to expected frequencies under a specified distribution (goodness-of-fit) or test independence between two categorical variables (two-way tables).
Basic one-way chi-squared test (goodness-of-fit):
Function: where X is a vector of observed frequencies, or from raw data.
You can specify the expected probabilities with: where contains the expected probabilities summing to 1.
Null: (observed frequencies follow the specified distribution) for all categories.
Test statistic:
Example (eye colors):
Observed: for Green, Blue, Brown
Expected (under some distribution):
Compute ExpFreqs.p = ExpFreqs / sum(ExpFreqs) to convert to probabilities, then run:
Output example: X-squared = 9.02, df = 2, p-value = 0.011
Example: survey data with possible warning about approximation
Data: read in survey.csv and construct a table with Smoking vs Exercise; run:
tbl <- table(survey$Smoke, survey$Exer)
chisq.test(tbl)
Possible warning: "Chi-squared approximation may be incorrect" if some expected counts < 5.
Remedy: combine categories to increase expected counts (e.g., merge None/Some exercise into a single category).
Example adjustment: combo <- tbl[,"None"] + tbl[,"Some"]; newtbl <- cbind(tbl[,"Freq"], combo); chisq.test(newtbl)
Two-way chi-squared test (tests of independence):
Raw data approach:
E.g., two factors Hair and Eyes; build vectors: Hair <- factor(…), Eyes <- factor(…)
Run: or convert to a matrix:
tbl <- table(Hair, Eyes)
chisq.test(tbl)
Output example (Hair vs Eyes): X-squared = 1.9658, df = 4, p-value = 0.7421
Tabulated data approach (matrix input):
ObsFreqMat <- matrix(c(6,8,4,7,6,6,4,6,7), ncol=3, byrow=TRUE)
chisq.test(ObsFreqMat)
Output: X-squared = 1.9658, df = 4, p-value = 0.7421
Warnings and caveats: chi-squared approximation may be incorrect when expected counts are small (<5). Remedies include data aggregation or alternative tests (e.g., Fisher’s exact test) not covered in slides but common in practice.
Additional data/examples referenced in exercises:
Example 1: heads.csv (fair coin check) – test whether data are consistent with p = 1/2.
Example 2: survey on Opinion x Personnel type – test independence between two categorical variables.
Example 3: Mathematics achievement by sex (Open University 1983) – test for relationship between sex and achievement; also asks to implement a custom chisq.test-like function.
Example 4: Red vs Blue colors in sports outcomes – test whether red wins are 50/50 across sports; test distribution similarity across sports.
7) Worked datasets and homework/exercises mentioned in slides
taste.txt: used for multiple hypothesis-testing exercises (1-sample and 2-sample Wilcoxon/t-test variants). Tasks include:
Is Brown pudding mean score higher than Green pudding mean score?
Check all assumptions graphically and formally.
iceRICEp423.txt: test whether method B heat of fusion is significantly lower than method A; check all assumptions.
fishmercuryRICEp451.txt: compare mercury levels between Selective Reduction vs Permanganate methods; also check subset where both > 0.4.
heads.csv: used in an example for coin-toss fairness (Youden dataset) with 9207 heads and 8743 tails grouped in fives; tests for fairness (p = 0.5).
red-blue.xls: dataset about wearing color and contest outcomes; used for hypothesis testing of color effects on winning probability and comparing across sports.
survey.csv: smoking and exercise data; used to illustrate chi-squared test for independence and to demonstrate warning messages and aggregation.
Tips and resources:
When chi-squared warnings occur, consider combining rows/columns or using alternative tests.
The eFundi resource and online tutorials cited in slides provide examples and code variations.
8) Quick reference: key outputs and interpretation
One-sample t-test: t-statistic, df = n - 1, p-value, 95% CI for the mean, and sample mean. Use to test whether the population mean equals a hypothesized value.
Two-sample t-test: compare two means; equal vs unequal variances (var.equal option). Look at t-statistic, df, p-value, and confidence interval for the difference of means.
Paired t-test: tests mean difference in paired observations; use when data are naturally matched.
Shapiro-Wilk: W statistic and p-value; used to assess normality. A small p-value suggests non-normality.
Wilcoxon tests: non-parametric alternatives to t-tests (test medians/ranks rather than means).
Chi-squared: assess goodness-of-fit or independence.
Goodness-of-fit: compare observed frequencies to expected under a specified distribution.
Independence: test whether two categorical variables are independent in a contingency table.
Warnings: small expected frequencies (<5) undermine chi-squared approximations; remedy by combining categories or using exact tests.
9) Key formulas recap (LaTeX)
One-sample t-statistic:
Two-sample t-statistic (equal variances):
Paired t-statistic:
Chi-squared (goodness-of-fit or independence):
Normality: Shapiro-Wilk test statistic W with p-value; used to assess normality assumption for t-tests.
// End of notes. Use these sections and formulas as a comprehensive study aid for Hypothesis Testing in the unit.