Z-Score and Hypothesis Testing: Study Notes (Video Transcript Summary)

Setup of the Example and Key Variables

Anchor (mean) of the population distribution: rac{75}{1} (the teacher uses 75 as the anchor mean for happiness).
Observed event: A person’s score (in the example, the brother’s happiness score is 58).
Population standard deviation (the typical bounce around the mean): rac{10}{1}.
The basic mapping goal is to understand where an observed score falls in a normal distribution with mean rac{75}{1} and standard deviation rac{10}{1}.
Observed example values:
- Observed score: $x = 58$
- Mean (anchor): rac{75}{1}
- Standard deviation: rac{10}{1}
Deviation (raw difference): d = x - rac{75}{1} = 58 - 75 = -17
Z-score (standardized deviation): z = rac{x - rac{75}{1}}{rac{10}{1}} = rac{-17}{10} = -1.7
Observed event vs. expectation: this is the core of how we map observed events to the z-distribution and compare to what we expect by chance.
Goals of the exercise: translate an observed score into a z-score, read the z-table, convert areas to probabilities and percentages, and then connect to hypothesis testing (null vs alternative) and error control (Type I/II errors) with one-tailed vs two-tailed tests.

From Observed Score to Z-Score and Mirror Score

The observed z-score for $x=58$ is $z = -1.7$ which means the observed score is 1.7 standard deviations below the mean.
Area interpretation on the z-table:
- The area to the left of $z=-1.7$ is approximately $P(Z \,\le\, -1.7) = 0.0446$ , i.e. 4.46%.
- The area between the mean (0) and $z= -1.7$ is about $P(0 \le Z \le -1.7) = 0.4554$ in magnitude but on the negative side; more commonly expressed as the absolute area from the mean to the value: $0.4554$ (i.e., 45.54% from the mean to 1.7 standard deviations away).
The symmetric (reflected) “doppelganger” concept: if the observed z is -1.7, the reflected z is +1.7, corresponding to a raw score of x_{ref} = rac{75}{1} + |d| = 75 + 17 = 92.
How to interpret the two sides: the observed score could be on either tail; the reflection helps illustrate symmetry in the normal distribution.
The raw score that corresponds to +1.7 standard deviations above the mean is 92 (since 75 + 17 = 92).

Converting Z to Probabilities and Percentages

One-tailed probability (lower tail for this example):
- $P(Z \le -1.7) = 0.0446$ , i.e. about 4.46%.
Two-tailed probability (extreme in either direction):
- $P(|Z| \ge 1.7) = 2 \times P(Z \le -1.7) = 2 \times 0.0446 = 0.0892$ , i.e. about 8.92%.
The teacher’s workflow emphasizes both probability (numerical likelihood) and percentage forms (what percent of the population would be at or beyond that extreme).
Practical note: some questions ask for the probability that someone would score below 58 (which is 4.46%), while others ask for the percentage in either tail beyond 58 (which would be 8.92% for two-tailed consideration).

The Z-Score Table and Mapping the Area

The z-table maps z-scores to areas under the standard normal distribution to the left of z.
For z = 1.7, the area to the left is approximately 0.9554; the area between the mean (0) and z = 1.7 is 0.4554.
The table is used to read off probabilities, which are then converted to percentages as needed.
Important conceptual takeaway: the z-table gives the area from the left up to z; when you want the area between the mean and z, you subtract 0.5 from the left-tail area or use the symmetric property of the normal curve.

Hypothesis Testing Framework (Null vs Alternative)

Research setup in the example: test whether the observed score belongs to the “human happiness” distribution (null) or to a different distribution (alien, alternative).
Null hypothesis (H0): the observed event belongs to the population distribution of humans with mean rac{75}{1} and standard deviation rac{10}{1}, i.e. $H_0: \mu = 75$ .
Alternative hypothesis (H1): the observed event does not belong to that distribution (two-tailed) or is different in a specific direction (one-tailed).
The mean anchor 75 is an expectation: the scientist says the “anchor” is what we expect by chance; the observed 58 is being compared to this anchor.
The goal is to determine if the observed event is significantly different from the null distribution or if it could reasonably arise by chance given the null.
The direction of the alternative (one-tailed vs two-tailed) depends on theoretical expectations:
- One-tailed: you expect a difference only in one direction (e.g., the score is especially low or especially high).
- Two-tailed: you expect a difference but not a specific direction (the score could be either much lower or much higher).

Significance Level, Alpha, and Decision Rules

The significance level (alpha) is the line in the sand for deciding significance.
The teacher uses $\alpha = 0.05$ (5%). This is a common default choice, though others (e.g., 0.01 or 0.001) are discussed as more conservative options.
One-tailed vs two-tailed critical values (decision boundaries):
- Two-tailed test at $\alpha = 0.05$ uses critical z-values $\pm z_{0.975} = \pm 1.96$ . If |z| > 1.96, reject H0.
- One-tailed test at $\alpha = 0.05$ uses critical z-value $z_{0.95} = 1.645$ (or -1.645 depending on direction). If the observed z is beyond that single-tailed boundary, reject H0.
The line in the sand is an arbitrary threshold chosen before looking at the data, reflecting how much error you’re willing to tolerate (Type I error rate).
Type I error (false positive) occurs when you reject H0 even though H0 is true. With $\alpha = 0.05$ , you expect about 5% false positives if you repeated the study many times.
Type II error (false negative) occurs when you fail to reject H0 even though H1 is true. Power is defined as $\text{Power} = 1 - \beta$ , the probability of correctly rejecting H0 when H1 is true.
The choice between one-tailed and two-tailed affects power: one-tailed tests have more power to detect a difference in the specified direction because the entire alpha is allocated to one tail; two-tailed tests split alpha across two tails, reducing power for any single tail when the direction is known.

How the Theoretical Framework Maps to the Given Example

If you test H0: $\mu = 75$ vs H1: a different distribution (alien), you would examine where the observed 58 falls in the null distribution.
For the one-tailed scenario (unhappy direction), the critical bound may be reached if $z \le -z_{0.95} = -1.645$ (for alpha = 0.05, one-tailed). If observed $z = -1.7$ , you would reject at a one-tailed 0.05 level.
For a two-tailed scenario (difference in either direction), you compare to $\pm 1.96$ . Since |z| = 1.7 < 1.96, you would not reject the null at the 0.05 level in a two-tailed test.
The choice of one-tailed vs two-tailed should be theory-driven (what you expect about the direction of the effect).
The “peapod” analogy: the null distribution is the population of humans; the alien is outside that distribution; significance tests ask how likely it is to observe such an extreme value if we were truly sampling only from the human distribution.

Confidence Intervals and Their Relation to Hypothesis Testing

Confidence interval (CI) around the observed score, using the same standard deviation as the bounce metric, can be constructed for interpretation.
Example: CI around the observed 58 with SE equal to the standard deviation (for this context) is calculated as:
- Center: $x = 58$; Margin: $z_{0.975} \times \text{SE} = 1.96 \times 10 = 19.6$
- 95% CI: $[58 - 19.6, \, 58 + 19.6] = [38.4, 77.6]$
Interpretation: If the true population mean were 75 (the null), would 75 lie in this interval? Yes, it does lie in [38.4, 77.6]. Therefore, at the two-tailed 0.05 level, you would not reject H0 based on this single-observation CI.
Confidence intervals and p-values answer the same question in different ways:
- P-values tell you the probability of observing data as extreme as what you got under H0.
- CIs tell you a range of plausible values for the true effect (or true mean) given the data.
For mediation or other effects, a CI that excludes zero indicates significance; if zero lies within the CI, the effect is not statistically significant at the corresponding level.
The 95% CI concept can be extended to other statistics (means, correlations, mediation effects, etc.). The same logic applies: whether 0 or another null value lies inside the interval informs significance.

Practical Notes, Metaphors, and Practice Tips

Draw the picture: to understand how scores map to the distribution, the teacher repeatedly emphasizes drawing a simple picture and building the logic from there.
Always frame the observed event as a z-score first, then read probabilities from the z-table, and finally translate back to raw scores if needed (e.g., to understand what a z of -1.7 means in original units).
Always distinguish between probabilities (as fractions) and percentages. They are the same information, just in different units.
Know the core numbers by heart for this example:
- Mean: $75$
- SD: $10$
- Observed: $58$
- Deviation: $-17$
- Z: $-1.7$
- P(X \le 58) = 0.0446 \,(4.46\%)
- Two-tailed p-value for |Z| >= 1.7: 0.0892\,(8.92\%)
- Two-tailed critical value at \alpha = 0.05 $:$ |Z| > 1.96
- One-tailed critical value at \alpha = 0.05 $:$ Z < -1.645 (for a lower-tail test)
- 95% CI around the observed: [38.4, 77.6]
Power, Type I error, and Type II error are three interconnected concepts:
- Type I error: reject H0 when H0 is true (α, e.g., 0.05, i.e., 5% of the time you will incorrectly reject a true null).
- Type II error: fail to reject H0 when H1 is true (β).
- Power: probability of correctly rejecting H0 when H1 is true (Power = 1 − β).
The choice of a one-tailed vs two-tailed test affects power and the allocation of the 5% alpha: one-tailed places all alpha in one tail, increasing power for a directional effect but at the risk of missing an effect in the opposite direction.
The speaker cautions about arbitrarily changing the alpha (e.g., lower than 0.05) and highlights the importance of pre-specifying the line in the sand before looking at data to avoid biased conclusions.
Real-world reporting: journals often report both p-values and confidence intervals to provide a fuller picture of the evidence and its precision. Confidence intervals give a sense of the range of plausible effects and how much the estimate might vary in repeated samples.
Final takeaway: statistics is about mapping observed events to distributions, understanding how much a result deviates from expectation, and making principled inferences about population-level effects while acknowledging uncertainty and error.

Quick Reference: Step-by-Step Walkthrough (Applied to the Example)

Step 1: Define the null distribution parameters: \mu_0 = 75, \; \sigma = 10.
Step 2: Compute z for the observed score: z = \frac{x - \mu_0}{\sigma} = \frac{58 - 75}{10} = -1.7.
Step 3: Read probabilities from the z-table: P(Z \le -1.7) = 0.0446\ ( ext{4.46\%}) $; two-tailed p-value:$ P(|Z| \ge 1.7) = 2 \times 0.0446 = 0.0892\ ( ext{8.92\%}).
Step 4: Interpret in context of alpha = 0.05:
- One-tailed decision: if the hypothesis is directional (e.g., unhappy direction only), check Z \le -1.645 $; since$ -1.7 < -1.645, you would reject H0 for a one-tailed test in that direction.
- Two-tailed decision: since |Z| = 1.7 < 1.96, you would not reject H0 at the 0.05 level for a two-tailed test.
Step 5: Optional cross-check with the confidence interval around the observed score: CI = [58 - 1.96\times 10,\; 58 + 1.96\times 10] = [38.4, 77.6]$$; since 75 lies inside, a two-tailed test would not reject H0 at 0.05.
Step 6: Reflect on power and error types: one-tailed tests tend to have more power to detect a directional effect, but require a priori justification; always consider Type I error control (α) and the possibility of Type II errors (power) for your study design.

Summary Takeaway (In One Sentence)

You map an observed score to a z-score, read off probabilities from the z-table, choose one-tailed or two-tailed tests based on theory, set a significance level (commonly 0.05), and interpret results with consideration of confidence intervals and the inherent uncertainty and error rates in statistical inference.