Hypothesis Testing Notes (STTN327, Unit 04)

Hypothesis Testing: Key Concepts and R Implementation (STTN327, Unit 04)

  • Purpose: test whether population parameters (means, proportions) align with hypothesized values or relationships between groups.

  • Types covered:

    • Mean of 1 population (μ)

    • Means of 2 independent populations (μ1, μ2)

    • Means of 2 dependent (paired) populations (μD)

    • Non-parametric tests ( Wilcoxon )

    • Chi-squared tests (Goodness-of-fit and Independence)

  • R focus: primary functions and typical options; how to interpret outputs and check assumptions.


1) Hypothesis tests for the mean of 1 population (µ)

  • Basic one-sample t-test in R:

    • Function: t.test(X)t.test(X)

    • Null hypothesis: H<em>0:μ=μ</em>0H<em>0: \mu = \mu</em>0 (default μ0=0\mu_0 = 0 if not specified)

    • Alternatives: alternative="less","greater","two.sided"alternative = {"less", "greater", "two.sided"}

    • Specifying a hypothesized mean: mu=μ0mu = \mu_0

    • Notes: Type "?t.test" for more options.

  • Key formula (one-sample):

    • Test statistic: t=Xˉμ0s/nt = \frac{\bar{X} - \mu_0}{s/\sqrt{n}}

    • Degrees of freedom: df=n1df = n - 1

    • Interpretation: p-value assesses evidence against H<em>0:μ=μ</em>0H<em>0: \mu = \mu</em>0 in favor of the specified alternative.

  • Example: Test if population mean body temperature is > 36 using 5 observations.

    • Data: temp=c(33,38,37,39,36)temp = c(33, 38, 37, 39, 36)

    • R command: t.test(temp,alternative="greater",mu=36)t.test(temp, alternative = "greater", mu = 36)

    • Output (interpreted):

    • t = 0.58277, df = 4, p-value = 0.2957

    • Alternative: true mean is greater than 36

    • 95% CI: (34.40513, Inf)

    • Sample mean: 36.6

    • Takeaway: with this small sample, there is not enough evidence to conclude that the population mean exceeds 36 at the 5% level.


2) Testing for normality (prerequisite for t-tests)

  • Why test normality?

    • The t-test assumes that the underlying population is approximately normal (especially relevant for small samples).

    • If the normality assumption is violated, p-values from t.test may not be trustworthy; consider non-parametric methods.

  • Graphical tests (to assess normality):

    • Normal QQ-plot (qqnorm + qqline)

    • Box-plot (less definitive for normality)

    • Histogram with kernel density and overlaid normal density

  • Example (graphical):

    • Data: temp <- c(33, 38, 37, 39, 36)

    • Commands:

    • QQ-plot: qqnorm(temp);\qqline(temp)

    • Box-plot: boxplot(temp)boxplot(temp)

    • Histogram with overlay: hist(temp,freq=FALSE)hist(temp, freq = FALSE)

      • Add density plot: d <- density(temp); lines(d, lty = 2)

      • Overlay normal density: lines(d$x, dnorm(d$x, mean(temp), sd(temp)))

  • Formal test: Shapiro-Wilk test

    • Command: shapiro.test(X)shapiro.test(X)

    • Null: H0:XH_0: X comes from a normally distributed population; Alternative: not normal.

    • Example: temp <- c(33, 38, 37, 39, 36)

    • Output: W = 0.9427, p-value = 0.6853

    • Interpretation: fail to reject normality (in this example).

  • Conclusion about normality:

    • If data are from a normal population, p-values from t.test can be trusted.

    • If not normal, consider non-parametric methods (e.g., Wilcoxon tests) instead of t-tests.


3) Hypothesis tests for 2 independent populations (μ1 vs μ2)

  • Null and alternative:

    • H<em>0:μ</em>1=μ<em>2H<em>0: \mu</em>1 = \mu<em>2 vs H</em>A:μ<em>1μ</em>2H</em>A: \mu<em>1 \neq \mu</em>2 (two-sided; can be one-sided with "less" or "greater").

  • Basic two-sample t-test in R:

    • Command variants:

    • t.test(X1,X2)t.test(X1, X2) where X1 and X2 are the two samples

    • Or: t.test(Xgrp)t.test(X \sim grp) where X is the pooled data and grp is a grouping variable

    • Options:

    • alternative = {\"less\", \"greater\", \"two.sided\"}

    • var.equal=TRUEFALSEvar.equal = TRUE|FALSE (assume equal variances or not; Welch's t-test if FALSE)

    • Help: Type "?t.test" for more options.

  • Assumptions to check before the t-test (two independent samples):

    • Normality of the pooled data (or each group) – assess by pooling centered and scaled data to test normality:

    • Center and scale each sample:

      • X=X<em>iXˉs</em>XX^* = \frac{X<em>i - \bar{X}}{s</em>X} for group 1,

      • Y=Y<em>jYˉs</em>YY^* = \frac{Y<em>j - \bar{Y}}{s</em>Y} for group 2

      • Xˉ=1n<em>1X</em>i,s<em>X2=1n</em>11(XiXˉ)2\bar{X} = \frac{1}{n<em>1}\sum X</em>i, \quad s<em>X^2 = \frac{1}{n</em>1-1}\sum (X_i - \bar{X})^2

      • Yˉ=1n<em>2Y</em>j,s<em>Y2=1n</em>21(YjYˉ)2\bar{Y} = \frac{1}{n<em>2}\sum Y</em>j, \quad s<em>Y^2 = \frac{1}{n</em>2-1}\sum (Y_j - \bar{Y})^2

    • Pool the standardized values and test normality on the pooled sample (why? to ensure the pooled distribution is normal under H0).

    • Equal variances: test H<em>0:σ</em>12=σ<em>22H<em>0: \sigma</em>1^2 = \sigma<em>2^2 vs H</em>A:σ<em>12σ</em>22H</em>A: \sigma<em>1^2 \neq \sigma</em>2^2

    • Command: var.test(X1,X2)var.test(X1, X2) or var.test(Xgrp)var.test(X \sim grp)

  • Example: Body temperatures (Men vs Women)

    • Data:

    • tempM <- c(37, 39, 36, 34, 35) (Men)

    • tempF <- c(33, 35, 33, 34) (Women)

    • Normality check (pooled after centering and scaling):

    • tempM.s = (tempM - mean(tempM)) / sd(tempM)

    • tempF.s = (tempF - mean(tempF)) / sd(tempF)

    • pooled <- c(tempM.s, tempF.s)

    • shapiro.test(pooled) # Example: W = 0.9140, p-value = 0.345 (illustrative)

    • Equal variances: var.test(tempF, tempM) # Example statistic: F = 4.0364, p-value = 0.281

    • Two-sample t-test (assuming equal variances):

    • Command: t.test(tempF,tempM,alternative=l¨ess,¨var.equal=TRUE)t.test(tempF, tempM, alternative = \"less\", var.equal = TRUE)

    • Output: t = -2.3066, df = 7, p-value = 0.02723; means: tempF ≈ 33.75, tempM ≈ 36.20

    • Interpretation: evidence that mean temperature of women is less than that of men (at typical significance levels).


4) Hypothesis tests for 2 dependent (paired) populations (μD)

  • Setup: test if mean difference between paired samples is zero.

  • Null/Alternative:

    • H<em>0:μ</em>D=0H<em>0: \mu</em>D = 0 vs H<em>A:μ</em>D0H<em>A: \mu</em>D \neq 0 (two-sided), or one-sided depending on hypothesis.

  • Basic paired t-test in R:

    • Command variants:

    • t.test(X1,X2,paired=TRUE)t.test(X1, X2, paired = TRUE)

    • Or: t.test(Xgrp,paired=TRUE)t.test(X \sim grp, paired = TRUE) where grp encodes pairing

    • Or: define differences: Xd=X2X1Xd = X2 - X1 and run t.test(Xd)t.test(Xd)

    • The paired t-test statistic:

    • Let differences d<em>i=X</em>2iX<em>1id<em>i = X</em>{2i} - X<em>{1i}, with mean \bar{d} and SD $sd$, then
      t=dˉsd/nt = \frac{\bar{d}}{s_d / \sqrt{n}} with df=n1df = n-1

  • Example: Body temperature before vs after sleeping (n = 5)

    • Data:

    • Before: tempBef = c(38, 39, 36, 34, 35)

    • After: tempAft = c(36, 39, 35, 35, 35)

    • Tests (any of the three approaches will work):

    • t.test(tempAft,tempBef,paired=TRUE,alternative=l¨ess)¨t.test(tempAft, tempBef, paired = TRUE, alternative = \"less\")

    • t.test(tempAfttempBef,alternative=l¨ess)¨t.test(tempAft - tempBef, alternative = \"less\")

    • Or: poolTemp <- c(tempAft, tempBef); grp <- rep(1:2, each = 5); t.test(poolTemp grp,paired=TRUE,alternative=l¨ess)¨t.test(poolTemp ~ grp, paired = TRUE, alternative = \"less\")

    • Common output:

    • Paired t-test; t = -0.7845, df = 4, p-value = 0.2383

    • Alternative: true difference in means is less than 0

    • 95% CI: (-Inf, 0.6870)

    • Mean difference: -0.4

  • Interpretation: with this data, there is not enough evidence that after sleeping temperatures are lower than before (at 5% level).


5) Non-parametric hypothesis tests (Wilcoxon tests)

  • When to use: when normality assumptions are questionable or sample sizes are small.

  • Basic Wilcoxon tests in R:

    • One-sample: wilcox.test(X)wilcox.test(X)

    • Two-sample: wilcox.test(X1,X2)wilcox.test(X1, X2) or wilcox.test(Xgrp)wilcox.test(X \sim grp)

  • Relationship to t-test: NB: Wilcoxon tests work similarly to t.test in terms of interpretation for median (not mean) and are robust to non-normal data.

  • Help: Type "?wilcox.test" for options.

  • Example exercises (from slides):

    • Repeat the exercise from last week using Wilcoxon tests on data in taste.txt:

    • Test whether mean score of Green pudding is greater than 35.

    • Test whether Green and Brown puddings differ in mean score.

  • Note: Wilcoxon tests can be a drop-in replacement for t-tests in some analyses, but interpret medians/rank-based results rather than means.


6) Chi-squared tests in R: Goodness-of-fit and Independence

  • Purpose: compare observed frequencies to expected frequencies under a specified distribution (goodness-of-fit) or test independence between two categorical variables (two-way tables).

  • Basic one-way chi-squared test (goodness-of-fit):

    • Function: chisq.test(X)chisq.test(X) where X is a vector of observed frequencies, or chisq.test(table(X))chisq.test(table(X)) from raw data.

    • You can specify the expected probabilities with: chisq.test(X,p=p<em>vec)chisq.test(X, p = p<em>vec) where p</em>vecp</em>vec contains the expected probabilities summing to 1.

    • Null: H<em>0:O</em>i=EiH<em>0: O</em>i = E_i (observed frequencies follow the specified distribution) for all categories.

    • Test statistic: χ2=<em>i=1k(O</em>iE<em>i)2E</em>i\chi^2 = \sum<em>{i=1}^k \frac{(O</em>i - E<em>i)^2}{E</em>i}

  • Example (eye colors):

    • Observed: ObsFreqs=(17,65,18)ObsFreqs = (17, 65, 18) for Green, Blue, Brown

    • Expected (under some distribution): ExpFreqs=(25,50,25)ExpFreqs = (25, 50, 25)

    • Compute ExpFreqs.p = ExpFreqs / sum(ExpFreqs) to convert to probabilities, then run:

    • chisq.test(ObsFreqs,p=ExpFreqs.p)chisq.test(ObsFreqs, p = ExpFreqs.p)

    • Output example: X-squared = 9.02, df = 2, p-value = 0.011

  • Example: survey data with possible warning about approximation

    • Data: read in survey.csv and construct a table with Smoking vs Exercise; run:

    • tbl <- table(survey$Smoke, survey$Exer)

    • chisq.test(tbl)

    • Possible warning: "Chi-squared approximation may be incorrect" if some expected counts < 5.

    • Remedy: combine categories to increase expected counts (e.g., merge None/Some exercise into a single category).

    • Example adjustment: combo <- tbl[,"None"] + tbl[,"Some"]; newtbl <- cbind(tbl[,"Freq"], combo); chisq.test(newtbl)

  • Two-way chi-squared test (tests of independence):

    • Raw data approach:

    • E.g., two factors Hair and Eyes; build vectors: Hair <- factor(…), Eyes <- factor(…)

    • Run: chisq.test(Hair,Eyes)chisq.test(Hair, Eyes) or convert to a matrix:

      • tbl <- table(Hair, Eyes)

      • chisq.test(tbl)

    • Output example (Hair vs Eyes): X-squared = 1.9658, df = 4, p-value = 0.7421

    • Tabulated data approach (matrix input):

    • ObsFreqMat <- matrix(c(6,8,4,7,6,6,4,6,7), ncol=3, byrow=TRUE)

    • chisq.test(ObsFreqMat)

    • Output: X-squared = 1.9658, df = 4, p-value = 0.7421

  • Warnings and caveats: chi-squared approximation may be incorrect when expected counts are small (<5). Remedies include data aggregation or alternative tests (e.g., Fisher’s exact test) not covered in slides but common in practice.

  • Additional data/examples referenced in exercises:

    • Example 1: heads.csv (fair coin check) – test whether data are consistent with p = 1/2.

    • Example 2: survey on Opinion x Personnel type – test independence between two categorical variables.

    • Example 3: Mathematics achievement by sex (Open University 1983) – test for relationship between sex and achievement; also asks to implement a custom chisq.test-like function.

    • Example 4: Red vs Blue colors in sports outcomes – test whether red wins are 50/50 across sports; test distribution similarity across sports.


7) Worked datasets and homework/exercises mentioned in slides

  • taste.txt: used for multiple hypothesis-testing exercises (1-sample and 2-sample Wilcoxon/t-test variants). Tasks include:

    • Is Brown pudding mean score higher than Green pudding mean score?

    • Check all assumptions graphically and formally.

  • iceRICEp423.txt: test whether method B heat of fusion is significantly lower than method A; check all assumptions.

  • fishmercuryRICEp451.txt: compare mercury levels between Selective Reduction vs Permanganate methods; also check subset where both > 0.4.

  • heads.csv: used in an example for coin-toss fairness (Youden dataset) with 9207 heads and 8743 tails grouped in fives; tests for fairness (p = 0.5).

  • red-blue.xls: dataset about wearing color and contest outcomes; used for hypothesis testing of color effects on winning probability and comparing across sports.

  • survey.csv: smoking and exercise data; used to illustrate chi-squared test for independence and to demonstrate warning messages and aggregation.

  • Tips and resources:

    • When chi-squared warnings occur, consider combining rows/columns or using alternative tests.

    • The eFundi resource and online tutorials cited in slides provide examples and code variations.


8) Quick reference: key outputs and interpretation

  • One-sample t-test: t-statistic, df = n - 1, p-value, 95% CI for the mean, and sample mean. Use to test whether the population mean equals a hypothesized value.

  • Two-sample t-test: compare two means; equal vs unequal variances (var.equal option). Look at t-statistic, df, p-value, and confidence interval for the difference of means.

  • Paired t-test: tests mean difference in paired observations; use when data are naturally matched.

  • Shapiro-Wilk: W statistic and p-value; used to assess normality. A small p-value suggests non-normality.

  • Wilcoxon tests: non-parametric alternatives to t-tests (test medians/ranks rather than means).

  • Chi-squared: assess goodness-of-fit or independence.

    • Goodness-of-fit: compare observed frequencies to expected under a specified distribution.

    • Independence: test whether two categorical variables are independent in a contingency table.

  • Warnings: small expected frequencies (<5) undermine chi-squared approximations; remedy by combining categories or using exact tests.


9) Key formulas recap (LaTeX)

  • One-sample t-statistic:
    t=Xˉμ0s/ndf=n1t = \frac{\bar{X} - \mu_0}{s/\sqrt{n}}\quad df = n-1

  • Two-sample t-statistic (equal variances):
    s<em>p2=(n</em>11)s<em>12+(n</em>21)s<em>22n</em>1+n<em>22s<em>p^2 = \frac{(n</em>1-1)s<em>1^2 + (n</em>2-1)s<em>2^2}{n</em>1+n<em>2-2} t=Xˉ</em>1Xˉ<em>2s</em>p1n<em>1+1n</em>2t = \frac{\bar{X}</em>1 - \bar{X}<em>2}{s</em>p \sqrt{\frac{1}{n<em>1} + \frac{1}{n</em>2}}}

  • Paired t-statistic:
    d<em>i=X</em>2iX<em>1i,dˉ=1nd</em>i,s<em>d2=1n1(d</em>idˉ)2d<em>i = X</em>{2i} - X<em>{1i}, \quad \bar{d} = \frac{1}{n}\sum d</em>i, \quad s<em>d^2 = \frac{1}{n-1}\sum (d</em>i - \bar{d})^2
    t=dˉsd/ndf=n1t = \frac{\bar{d}}{s_d/\sqrt{n}}\quad df = n-1

  • Chi-squared (goodness-of-fit or independence):
    χ2=<em>i=1k(O</em>iE<em>i)2E</em>i,df=k1(또는 for contingency tables df=(r1)(c1))\chi^2 = \sum<em>{i=1}^k \frac{(O</em>i - E<em>i)^2}{E</em>i}, \quad df = k - 1 \quad(또는\text{ for contingency tables } df = (r-1)(c-1))

  • Normality: Shapiro-Wilk test statistic W with p-value; used to assess normality assumption for t-tests.


// End of notes. Use these sections and formulas as a comprehensive study aid for Hypothesis Testing in the unit.