Statistics Notes: Z-scores, Distributions, Hypothesis Testing, and the Toolbox

Z-scores and the standard normal distribution

Purpose: put different distributions on a common scale so we can compare a score across distributions and talk about proportions, probabilities, and tail areas.
Key idea: convert a raw score X to a standard score z using the formula
$z = \frac{X - \mu}{\sigma}$
where (\mu) is the mean and (\sigma) is the standard deviation of the distribution the score comes from.
Why this helps: with z-scores we can place Warren’s score in each distribution’s tail and compare how extreme it is relative to that distribution.
Larger z-score means farther into the tail and thus a smaller tail probability (less common as you go out into the tail).

The Warren example: two distributions for two job-contexts

Distributions Warren could be evaluated against:
- Academic distribution: mean (\mu1 = 27), standard deviation (\sigma1 = 6.3)
- Consultant distribution: mean (\mu2 = 25), standard deviation (\sigma2 = 8.7)
Warren’s raw score: (X = 34) in both contexts.
Compute z-scores:
- Academic: $z_1 = \frac{34 - 27}{6.3} \approx 1.11$
- Consultant: $z_2 = \frac{34 - 25}{8.7} \approx 1.03$
Interpreting the z-scores: the larger the z, the farther into the tail; Warren is more extreme relative to the academic distribution (1.11 > 1.03).
Tail areas (probability of being at or beyond that z):
- For (z = 1.11): (P(Z > 1.11) \approx 0.1335) → about 13.35% of people would score higher.
- For (z = 1.03): (P(Z > 1.03) \approx 0.1515) → about 15.15% of people would score higher.
Conclusion from this approach: Warren has a higher tail standing in the academic distribution (slightly more extreme in the tail) than in the consultant distribution, so in terms of “superstar” potential (out toward the tail), the academic context shows a more extreme position.
Important caution from the instructor: do not average the distributions or mix them. Compare Warren within each distribution separately, then interpret comparatively.
Practical takeaway: the z-score lets you map a raw score to a standardized position, enabling direct comparison of probabilities across different scales.

Steps and best practices demonstrated in the extraction process

First extract the key numbers:
- X = 34, academic mean = 27, academic (\sigma = 6.3), consultant mean = 25, consultant (\sigma = 8.7).
Draw a picture (the instructor’s “draw a stupid picture” rule): there are two distributions to consider, so you visualize both relative to Warren’s score.
Identify the right question: given Warren’s score, what percentage of the population in each distribution would exceed that score? This is a tail-area question, not just a raw difference.
Avoid misreads: mixing means or misplacing the reference distribution leads to incorrect conclusions.
Use the z-table to convert z to a tail probability, then express as a percentage.

General process: from raw score to a percentile/percentage in a distribution

Step 1: identify the distribution (mean and SD) Warren’s score belongs to.
Step 2: compute the z-score: $z = \frac{X - \mu}{\sigma}.$
Step 3: consult the z-table (standard normal) to find the area to the right of z (or to the left, depending on the question).
Step 4: convert the area to a probability and, if desired, to a percentage by multiplying by 100.
Step 5: compare across the relevant distributions; the one with the larger z (more extreme) corresponds to the smaller tail probability, i.e., the “superstar” position in that context.

Hypothesis testing: a separate example with Doctor Geek

Setup: Doctor Geek hypothesizes that the first customer’s score of (x_I = 10) may come from a bot rather than a human.
Null hypothesis ((H_0)): the observed value comes from the human population (i.e., it is a typical human score).
Alternative hypothesis ((H_A)): the observed value does not come from a human (i.e., it is a bot). In practice, this is framed as not a typical human value.
Important nuance: hypothesis testing does not prove something is a bot; it tests whether the observed value is unlikely under the assumption that it comes from humans.
One-tailed vs two-tailed: because the theory is that bots would produce unusually low (or high) scores, you may use a one-tailed test in the left tail (extremely low values) if that is the direction of interest.
Significance level: typically (\alpha = 0.05).
Example outcome in the transcript: for the academic distribution, the p-value was reported as approximately (0.0035) (less than (0.05)); thus the null hypothesis is rejected in favor of the alternative that the value is not from humans (i.e., likely a bot).
Key interpretation: statistical tests support a theory but do not on their own prove it; theory and context matter (the cat vs. alien vs. bot example is used to illustrate how hypotheses are framed and tested).
Takeaway about hypothesis testing: focus on the null distribution, the directionality of the test, and the p-value relative to the chosen significance level.

The statistical toolbox: organizing tests by the question asked

The instructor emphasizes focusing on the question, not the test label:
- What is the effect being tested? (the quantity or relationship of interest)
- Is the question about means (differences between groups or from a known mean) or about relationships/predictors (correlations, regressions)?
Three broad families of tests (two-mean differences and beyond):
- Mean differences tests (two groups): tests differences between means when you have two groups.
- More-than-two-groups tests: ANOVA (analysis of variance) for comparing more than two group means.
- Relationships: correlation and regression for relationships between variables.
Other data types and tests:
- Chi-square: analysis of frequency (count) data, typically in contingency tables; less central in the course’s focus on variance and distributional assumptions.
- The emphasis is on variance as the core concept (not just counts): variance drives the interpretation of most tests, including regression and correlation.
Prototypical setup examples to visualize tests:
- Independent two-group means: left vs right side of the room; shoe size as a continuous outcome; two independent samples.
- Dependent means (paired): the same people measured before/after training; time 1 vs time 2; paired observations.
- Regression and correlations: predicting a continuous outcome from one or more predictors; correlation is the bivariate case; regression extends to multivariate cases.
Summary mapping: if the question asks about means with groups, think t-tests or ANOVA; if it asks about relationships, think correlations and regression.

Two means tests: independent, dependent, and the single-sample variant

Single-sample t-test (two-mean differences in one sense): compare a sample mean to a known population mean ((\mu_0)).
- Effect being tested: the difference between sample mean and a known population mean.
- Formula (conceptual): $t = \frac{\bar{X} - \mu_0}{s / \sqrt{n}}.$
Independent two-sample t-test (two independent groups): compare means from two independent samples.
- Effect being tested: difference between two group means.
- General idea: use a pooled or separate variance estimate to compute the t-statistic.
Dependent (paired) t-test: compare means from the same subjects measured in two conditions or times.
- Effect being tested: mean difference within pairs (time 1 vs time 2, or before vs after).
- Formula (conceptual): $t = \frac{\overline{D}}{sD / \sqrt{n}},$ where (Di) are the differences within pairs and (s_D) is the standard deviation of the differences.
Prototypical language to use:
- When you hear independent groups, think independent two-sample t-test.
- When you hear the same people measured twice, think dependent (paired) t-test.
- When you hear comparison to a known population mean, think single-sample t-test.

ANOVA, regression, and the big toolbox organization

ANOVA (analysis of variance): used when there are more than two groups to compare means.
- It uses an F statistic and tests whether there is any difference among the group means.
Regression: a general and flexible framework that extends correlation to multiple predictors; can handle complex relationships and multivariate designs.
- Conceptual view: regression models how the mean of a dependent variable changes with predictors; correlation is the bivariate case that regresses one variable on another in a simplified way.
Chi-square (for counts): used for frequency data in contingency tables; less central to the course’s focus on variance-based reasoning, but important for certain kinds of data (counts, categories).
Takeaway about choosing tests:
- If your data are about means and there are more than two groups, think ANOVA rather than multiple t-tests.
- If your question is about relationships between variables, start with correlations; for prediction and multivariate relationships, move to regression.

Correlation vs. regression: how they relate to variance

Correlation: measures the strength and direction of a relationship between two continuous variables.
- Prototype: two continuous variables with a linear relationship; the correlation coefficient r captures the degree of linear association.
- Limitation: limited to two variables in the basic form; does not imply causation.
Regression: extends correlation to predict one variable from one or more predictors; multivariate regression handles multiple predictors.
- Key interpretation: regression examines how the dependent variable changes as the predictors vary, effectively modeling variance explained by predictors.
- In the course context, regression is framed as a multivariate expansion of the correlation idea; it is the main tool going forward.
Core philosophy: statistics is about understanding and modeling variance, not just computing numbers. The null distribution, area under the curve, and effect sizes all rest on how variability is structured in the data.

Chi-square and other note-worthy tools

Chi-square test (brief): analyzes frequency data (counts) in categorical data, e.g., how many people prefer fruit A vs B by gender. It tests whether observed frequencies differ from expected frequencies under a null model.
The instructor’s stance: in many topics within this course and in the field, chi-square is not as central as variance-based methods (but still a valid tool when data are counts).

Study strategy and process-oriented tips from the lecture

Process over product: focus on the steps and the reasoning, not just getting the right numeric answer.
Always extract the key numbers clearly: write down X, the means, and standard deviations; keep the numbers in view to avoid transcription mistakes.
Draw a picture: a visual representation (distribution plots) helps prevent misinterpretation and keeps you honest about what is being measured.
Use the right tool for the right question: know your prototypes (two-group independent means, paired means, single-sample mean, regression, correlation) so you can pick the correct test quickly.
Don’t average across distinct distributions: when you have two different reference distributions, treat them separately rather than blending them into one.
Step-by-step workflow that the instructor emphasized:
- From raw score to z-score: identify mean and SD of the target distribution, compute z, consult the z-table for tail areas.
- Interpret what the tail area represents (the proportion of the population with scores more extreme in that direction).
- Translate the tail area to a percentage when answering questions about “what percentage” or “how many.”
- If there are multiple distributions, repeat for each and then compare the relevant tail areas.
The importance of showing your work: the instructor values the process (extract, compute, interpret) and will dock points for arithmetic mistakes, not for understanding if you can fix the process.
The role of theory in statistics: data analysis supports an argument about human behavior or phenomena (e.g., bots vs. humans); the statistics themselves do not prove a theory, they provide a probabilistic basis for judging likelihoods.

Course scaffolding and upcoming work mentioned

Barcodes course materials: chapters 2 and 3 (correlations) in the correlations folder; focus on what a correlation coefficient is, its boundary conditions, and what it is good for.
Emphasis on the standard normal framework: you’ll use z-scores and standardization to connect effect sizes with probabilities and p-values.
The plan to build a shared toolbox: organize tests by the question (means vs. relationships) rather than by test name alone, and practice moving fluidly between tests as needed.
The instructor’s emphasis on “variance” as the core focus of study: variability between groups, and how different tests quantify and interpret that variability.

Quick reference formulas (used or referenced in the transcript)

Z-score: $z = \frac{X - \mu}{\sigma}$
Single-sample t-test (conceptual): $t = \frac{\bar{X} - \mu_0}{s / \sqrt{n}}$
Independent two-sample t-test (conceptual): $t = \frac{\bar{X}1 - \bar{X}2}{\sqrt{sp^2\left(\frac{1}{n1} + \frac{1}{n2}\right)}}$ where (sp^2) is the pooled variance estimate.
Paired (dependent) t-test (conceptual): $t = \frac{\overline{D}}{sD / \sqrt{n}}$ where (\overline{D}) is the mean of the paired differences and (sD) is the standard deviation of those differences.
Correlation coefficient: $r = \frac{\text{cov}(X,Y)}{\sigmaX \sigmaY}$
Chi-square test statistic (counts): $\chi^2 = \sum \frac{(Oi - Ei)^2}{E_i}$

Final reminders for exam preparation

Be comfortable translating a word problem into a concrete test: identify the distribution(s), the effect of interest, and the right test type.
Practice extracting the key numbers, showing each calculation, and interpreting what the results mean in context.
Ground your approach in a clear understanding of what the z-score represents and how tail areas translate into percentages or probabilities.
Remember the distinction between variance-focused analyses (means, ANOVA, regressions) and count-based analyses (chi-square): they answer different kinds of questions.
Keep a steady, process-oriented workflow: extract -> compute -> interpret -> decide which tool to use next.