Advanced Note-Taking: Hypothesis Testing, Error Types, and Effect Sizes, and Statistical Power

Evaluation of P-value Interpretations

Conceptual Exercise: Assessing Theoretical Statements

* Statement 1: "The probability of the null hypothesis is true, given data is extreme or more extreme."

Verdict: Incorrect.

Explanation: P-values refer to the probability of the data in particular, not the probability of the hypothesis itself.

* Statement 2: "The probability of the data as extreme or more extreme given the null."

Verdict: Correct.

Explanation: A p-value is a conditional probability. It measures the likelihood of observing data at a certain value or further from the mean (more extreme in either positive or negative directions), conditional on the assumption that the null hypothesis is true.

* Statement 3: "The probability the result is due to chance."

Verdict: Incorrect.

Explanation: This interpretation is commonly taught but lacks specificity. It does not define "chance" relative to a baseline (e.g., compared to the null being true vs. false). Random sampling variability is always at play, meaning error is always present; this statement fails to contextualize that error under the null hypothesis.

* Statement 4: "P-value is inverse to how important the effect size is; smaller p-values indicate more important effects." * Verdict: Incorrect.

Explanation: On its own, a p-value tells the researcher nothing about the practical importance or magnitude of an effect. It only indicates how rare an event is if the null hypothesis is true. It cannot describe a world where an effect actually exists.

The Logic and Procedure of Hypothesis Testing

Decision-Making Role of Statistics: Beyond describing data and reducing uncertainty, statistics assists in making binary decisions (e.g., "Is there a relationship or not?").
The Five-Step Model Recap:

1. Adopt a Null Hypothesis ( $H_0$ ): Assume nothing unusual is happening and no relationship exists between variables. * For Correlations: Assume there is no covariance. * For Group Differences: Assume the independent variable has no effect on the dependent variable.

2. Test Probability Statistically: Calculate the likelihood of the observed data assuming $H_0$ is true.

3. Compare to Alpha ( $\alpha$ ): If the probability (p-value) is lower than the threshold (alpha), the finding is "statistically significant."

4. Statistical Significance: This shorthand term specifically means the result is "rare" or "surprising" under the null assumption.

5. Failure to Reject: If the p-value is higher than alpha, we "fail to find evidence of a relationship." This does not mean the effect is "statistically insignificant"; it simply means the evidence was not surprising enough to reject the null. An effect might still exist, but the study failed to detect it.

Inferential Errors: Type I and Type II

The Two Worlds of Reality: Nature hides the truth from us in an unobservable world where either the null hypothesis is true (no relationship) or false (a relationship exists). As researchers, we make one of two observable decisions: retain or reject the null.
Type I Error ( $\alpha$ ):

* Definition: Rejecting the null hypothesis when it is in fact true.

* Mechanism: Occurs when sampling error produces extreme data by chance, leading the researcher to claim a relationship exists when it does not (a "false alarm").

* Probability: The probability of a Type I error is exactly equal to the alpha level ( $\alpha$ ) set by the researcher.

* Control: To reduce Type I error, one should decrease alpha (e.g., moving from $0.05$ to $0.01$ ). This pushes the threshold for rejection further out from the mean (e.g., from $2$ standard errors to approximately $2.6$ standard errors). * Probability of Correct Retention: In a world where the null is true, the probability of correctly retaining it is $1 - \alpha$ (e.g., $95\%$ if $\alpha = 0.05$ ).

Type II Error ( $\beta$ ):

* Definition: Failing to reject the null hypothesis when it is in fact false.

* Mechanism: Occurs when there is a real effect in the population, but the observed sample data falls within the "regular" or non-extreme territory of the null distribution, often due to overlapping distributions.

* Probability: Represented by capital Beta ( $\beta$ ). A conventional permissible level for beta is often set at $0.2$ ( $20\%$ ).

* Trade-off: Decreasing alpha (to be conservative against Type I errors) mathematically increases the likelihood of a Type II error, as it becomes harder to reject the null even when an effect is present.

Statistical Power

Definition: The ability to correctly reject the null hypothesis when it is false (detecting a real effect).
Formula: $\text{Power} = 1 - \beta$
Maximizing Power: Power is the probability of finding an effect that is actually there. It can be thought of as using a larger "magnifying glass" in research.
Factors Influencing Power: 1. Alpha Level ( $\alpha$ ): Increasing alpha (e.g., from $0.01$ to $0.05$ ) increases power because the region of rejection is larger. 2. Sample Size ( $N$ ): Larger sample sizes narrow the sampling distributions by decreasing the standard error. Narrower distributions are less likely to overlap, making it easier to distinguish between different populations. 3. Effect Size: Larger underlying effects in nature are easier to detect. While researchers usually cannot control the effect size of the phenomenon they study, choosing questions with expectedly large effects increases power. 4. Measurement Quality: Better measurement reduces noise and improves detection.

Effect Sizes: Measuring Magnitude

Definition of Effect Size: A quantified description of the strength of the relationship between variables. It answers "how much" or "how big," whereas significance testing only answers "is it rare?"
Raw Effect Sizes: These are expressed in the original units of measurement. * Example: In a knowledge test out of $20$ points, an improvement of $3.29$ points is a raw effect size. This can also be expressed as a percentage (e.g., a $16\%$ increase).
Standardized Effect Sizes: These are unitless metrics that allow comparison across studies using different scales. * Correlation Coefficient ( $r$ ): Measures the magnitude of relationship between two quantitative variables. * Rules of Thumb: Small ( $0.1$ ), Medium ( $0.3$ ), Large ( $0.5$ ). * Cohen's d ( $d$ ): Measures the difference between two group means standardized by the number of standard deviations. * Formula Context: It is the number of standard deviations one mean is from another. * Pooling Standard Deviations: If groups have different standard deviations, researchers may use the smaller one or use a "pooled standard deviation" (an average weighted by sample size and sum of squares). * Rules of Thumb: Small ( $0.2$ ), Medium ( $0.5$ ), Large ( $0.8$ ). * Visual Threshold: A medium effect ( $r = 0.3$ or $d = 0.5$ ) is usually the threshold where a relationship becomes visually identifiable in a scatter plot without the aid of best-fit lines.

Practical vs. Statistical Significance

Statistical Significance: Only indicates that the data is rare (unlikely) in the context of the null hypothesis.
Practical Significance: Involves judgment about the real-world value of an effect. * Contextual Example (Sunscreen): Sunscreen use may have a "tiny" effect size on individual risk, but at a population health level, it saves thousands of lives, making it practically very significant. * Contextual Example (Industry): The binary logic of hypothesis testing originated in industry (e.g., breweries/distilleries) where a black-and-white decision was required: "Is this batch of beer good or should we throw it out?"

Confidence Intervals (CIs)

Definition: A range of plausible values for a population parameter based on a sample.
CIs vs. Hypothesis Testing: CIs are considered more nuanced and powerful because they provide both the effect size (the point estimate) and the precision of that estimate (the width of the interval).
Using CIs for Hypothesis Testing: * Retain Null: If the $95\%$ CI includes the null value ( $0$ for mean differences or $0$ for correlations), the result is not statistically significant. * Reject Null: If the $95\%$ CI excludes the null value, the result is statistically significant.

Questions & Discussion

Participant Question regarding adjusting alpha: "Is it sort of cheating to adjust alpha once we have our data?" * Response: Yes. Alpha should be set before collecting or analyzing data. This is part of a "power analysis" where parameters are set ahead of time to maintain the integrity of error control.
Participant Question regarding group variances: "If you've got two groups, don't they have different standard deviations?" * Response: Yes, they often do. Researchers handle this by pooling standard deviations or, in simpler cases, averaging them if sample sizes are similar. Standard software like Jamovi handles the complex math of pooled sums of squares to arrive at a single standard deviation for the Cohen's d calculation.
Discussion of Dr. Singh's History Test Example: * Tutorial group ( $n = 8$ ) Mean score: $16.7$ * Population Mean score: $13.4$ * Difference: $3.29$ points higher (Raw Effect Size). * Zed score: $4.23$ * Effect size reported: Cohen's $d = 1.5$ (Standardized Effect Size). This was characterized as a "substantially improved" score, not just because the p-value was tiny, but because the magnitude ( $d = 1.5$ ) is exceptionally large in a research context.