Week 5 Notes: Hypothesis Testing (NHST, z-test, t-test, paired-samples, independent-samples, and Pearson correlation)

Part 1. Introduction to Inferential Statistics

Inferential statistics in psychology revolve around using information from a sample to make inferences about a population. The foundational distinction is between populations and samples: a population is the entire group of interest (for example, all PSY2041 students), while a sample is a smaller subset drawn from that population. Descriptive statistics describe or characterize the sample or population as it is, whereas inferential statistics are used when we want to use data from a sample to infer something about the population. The material emphasises the idea that probabilities govern how surprising an observed outcome should be if we assume no effect, which links directly to the concept of NHST.

Probability and the ‘surprisingness’ of events are central. The lower the probability of an event, the more surprised we should be when it happens. For instance, measuring a sample mean near the center of the distribution is less surprising than obtaining extreme values like 78 or 122, and extremely unlikely values like 68 or 142 would be very surprising. These ideas guide how we decide whether observed data are compatible with the null hypothesis.

Null Hypothesis Significance Testing (NHST) is introduced as one framework for conducting inferential statistics. We begin by stating there will be no effect (the null hypothesis, H0) and then assess how surprised we are by the data. We use the notation H0 for the null hypothesis and H1 for the alternative hypothesis. In psychology, we typically do not say we prove or disprove a hypothesis; we either reject H0 or fail to reject H0. A classic IQ NHST example sets H0 as: H0: \mu = 100, the mean IQ of PSY2041 students equals the population mean, and H1: \mu \neq 100, indicating a difference.

Significance and p-values are the practical tools for NHST. The p-value quantifies how surprising the observed data would be if the null hypothesis were true. P-values range from 0 to 1; smaller p-values indicate greater surprise under H0. Conventionally, p-values below .05 (p < .05) are deemed significant, and p-values above .05 are non-significant. While this threshold is conventional, it is not magical; it reflects a convention about long-run decision accuracy in many contexts. A p-value of, say, 0.0000001 is highly significant, whereas a p-value of 0.37 suggests no reason to reject H0.

The IQ example illustrates this directly. The population IQ is assumed to have mean 100 and standard deviation 15. With a sample mean of 115 (and sample size N = 30) and known population standard deviation, a z-test yields z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{N}} = \frac{115 - 100}{15 / \sqrt{30}} = 5.48. Under the null, z is distributed as N(0,1). A z of 5.48 corresponds to a p-value around 1 \times 10^{-7}, which is far below 0.05, so we reject H0 and conclude that PSY2041 students do not have the same average IQ as the general population. A second z-test example with z = 0.36 yields p = 0.37, leading to a non-significant result and a failure to reject H0, meaning we do not have evidence that the means differ.

Research questions and hypotheses should be defined prior to data collection. A research question is followed by hypotheses, typically with H0 stating no effect or no difference and H1 stating there is an effect or difference. These hypotheses are informed by prior literature and theory, often summarized in a literature review and rationale within a report. Some example question–hypothesis pairs include: Are NBA players taller on average than the general public? H0: NBA players have the same mean height as the public; H1: NBA players are taller on average. Does a new medication impact depression symptoms? H0: Depression symptoms are the same on average for medication and placebo; H1: Depressive symptoms differ between the two groups. Is sleep loss associated with more dangerous driving? H0: No association between sleep quantity and crash likelihood; H1: A significant association exists.

A three-step view of NHST is emphasized: Step 1, compute a test statistic that has a known distribution under the null; Step 2, determine how extreme the statistic would be if H0 is true; Step 3, interpret the result, where more extreme statistics yield smaller p-values and stronger evidence against H0. The null hypothesis here typically specifies a mean of 100 for IQ or a zero effect for other measures, with data types that are continuous and either normally distributed or approximately normal.

Key characteristics of these hypotheses and data include: H0 states a population mean of 100 for PSY2041 IQ; we compare a single sample to a reference value; population standard deviation is known; H1 is non-directional (two-tailed) unless a specific directional hypothesis is stated; data are continuous (interval or ratio) and assumed to be normally distributed. When the population standard deviation is unknown, a one-sample t-test is used instead of a z-test. The material also notes that the assumption of known population SD is rarely true in practice, which motivates the use of the t-test in many real-world scenarios.

Part 2. One-Sample Hypothesis Tests

The one-sample z-test is appropriate when comparing the mean of a single sample to a reference value and when the population standard deviation is known. The general formula is z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{N}}. In the IQ example, the reference value is 100 and the general public SD is 15. A first example uses data where \bar{x} = 115, N = 30, and \sigma = 15, giving z = \frac{115 - 100}{15 / \sqrt{30}} = 5.48. This yields a p-value around 1 \times 10^{-7}, which is highly significant, leading to rejection of H0 and the conclusion that PSY2041 students do not share the same mean IQ as the general public. A second example uses z = 0.36 with p = 0.37, which is not significant, so we fail to reject H0 and conclude no evidence of a difference.

The one-sample z-test rests on several assumptions: (a) data come from a single sample measured on an interval or ratio scale; (b) the research question involves comparing the mean to a reference value; (c) the data are normally distributed; (d) the population standard deviation is known. In practice condition (d) is rarely true, in which case a one-sample t-test is more appropriate.

The one-sample t-test is the appropriate test when you have a single sample with continuous interval/ratio data, the research question involves comparing the mean to a reference value, and the data are normally distributed, but the population standard deviation is unknown. The t-statistic is computed as t = \frac{\bar{x} - \mu_0}{s / \sqrt{N}}, where s is the sample standard deviation. In the example, with \bar{x} = 115, s = 13.01, N = 30, we get a t-statistic of t = \frac{115 - 100}{13.01 / \sqrt{30}} = 6.32. The degrees of freedom are df = N - 1 = 29. The p-value is reported as p < .0001. Since this is less than 0.05, we reject H0 and conclude that PSY2041 students do not have the same average IQ as the general public.

Recap: The z-test vs the t-test shows that the z-test is used when the population SD is known, whereas the t-test is used when the population SD is unknown and is estimated from the sample. The t-test uses df = N - 1 for a one-sample case. The two-sample and paired-sample tests are addressed in the next section.

Part 3. Between-Groups Hypothesis Tests

The material then moves to tests that compare two groups. The paired-samples t-test is used when the same individuals are measured twice (e.g., before and after treatment). Example data: Depression severity (BDI) before and after CBT for 100 participants. Before: mean = 27.03, SD = 16.70; After: mean = 19.20, SD = 16.78; Difference scores (after − before) have a mean of -7.83 and SD of 6.45. The test statistic is calculated on the distribution of the differences, effectively performing a one-sample t-test on the difference scores. The reported result is t with df = N - 1 (in the example, df = 99) and t = -12.14, p < .05, indicating a significant improvement after treatment. The null hypothesis for the paired-samples t-test is H0: the mean difference is 0; the alternative is H1: the mean difference ≠ 0. The distribution of differences is used to compute the t-statistic: t = \bar{D} / (s_D / \sqrt{N}).

The independent-samples t-test compares two independent samples. The example provided contrasts depressive symptoms after 16 weeks of a novel anti-depressant vs. an established formula. The two groups are independent, so a paired design is not possible, and we look at the difference between the two distributions rather than difference scores. In the example, the sample means are 28.75 (SD = 10.89) for the Novel group and 29.01 (SD = 11.03) for the Existing group. The text notes the statistic is t(98) with a non-significant result (the reported figure was t = -0.12, p > .05). A standard formula for the independent-samples t-test with equal variances is: t = \frac{\bar{X}1 - \bar{X}2}{sp \sqrt{\frac{1}{n1} + \frac{1}{n2}}}, where sp^2 = \frac{(n1 - 1)s1^2 + (n2 - 1)s2^2}{n1 + n2 - 2}. If equal variances are not assumed, a different version of the t-test is used. The null hypothesis is H0: \mu1 = \mu2, and the alternative is H1: \mu1 ≠ \mu2 (two-tailed).

A lab report example for independent-samples t-tests is given in the material. It uses known-groups validity to compare levels of academic anxiety between students who self-report having used generative AI to assist them in preparing an academic assessment during Semester 1 and those who did not. The rationale and hypotheses for this question would be discussed in tutorials.

Part 4. Correlational Hypothesis Tests

The Pearson correlation analysis is introduced as a method to quantify the strength of association between two continuous variables X and Y. A correlation coefficient r ranges between -1 and 1. The sign indicates whether the relationship is positive or negative, and the magnitude indicates the strength of the association. For example, r = 0.4 indicates a moderate positive relationship, r = 0.8 a strong relationship, and r = 1 a perfect relationship; similarly, r = -0.4 indicates a moderate negative relationship, with larger magnitudes indicating stronger relationships. Real-world examples include anxiety and depression symptoms (positive correlation) or sleep and likelihood of traffic accidents (negative correlation).

The Pearson correlation test is appropriate when two variables are measured on an interval or ratio scale, are normally distributed, and the aim is to understand the population association between them. The test statistic is r, with the null hypothesis that the population correlation ρ = 0. Significance testing involves assessing how likely it would be to observe a sample correlation as extreme as the one observed if the population correlation were truly 0. The material notes that the calculation of the p-value for r will be discussed in Week 8, but the general approach is to transform r into a t-statistic: t = r \sqrt{\frac{n - 2}{1 - r^2}}, with df = n - 2, and compare to the appropriate t-distribution to obtain a p-value.

It is important to remember that correlation does not imply causation. Spurious correlations can arise due to chance or from a third variable that influences both measured variables. The lab report example for correlational analysis concerns test-retest reliability, where a Pearson correlation assesses the consistency of test scores across two time points (X and Y). The literature often uses intraclass correlation (ICC) for reliability, but Pearson is used here as an accessible alternative within PSY2041.

Throughout Week 5, careful attention is given to the framing of research questions and hypotheses, the choice of appropriate test statistics, and the interpretation of p-values in the context of NHST. Optional readings are suggested (e.g., Navarro, 2016, Chapter 13 on comparing two means) for deeper understanding, and several YouTube channels are recommended as supplementary resources.

Key formulas recalled in this week’s notes include: the z-test, t-test, paired-samples t-test, independent-samples t-test, and correlation-based tests. The z-test statistic is z = \frac{\bar{x} - \mu0}{\sigma / \sqrt{N}}, the one-sample t-test statistic is t = \frac{\bar{x} - \mu0}{s / \sqrt{N}}, the paired-samples t-test statistic is t = \frac{\bar{D}}{sD / \sqrt{N}}, the independent-samples t-test statistic (two-sample assuming equal variances) is t = \frac{\bar{X}1 - \bar{X}2}{sp \sqrt{\frac{1}{n1} + \frac{1}{n2}}}, with sp^2 = \frac{(n1 - 1)s1^2 + (n2 - 1)s2^2}{n1 + n_2 - 2}, and the correlation test statistic is t = r \sqrt{\frac{n - 2}{1 - r^2}}. The general decision rule is to reject H0 when the p-value is less than 0.05, indicating the observed data would be very unlikely if there were no true effect or relationship.

If you need this organized differently or with more or fewer examples, I can adapt the notes to fit a specific study plan or exam format.