Exhaustive Guide to Inferential Statistics, Experimental Designs, and External Validity

Inferential Statistics

Definition and Big Picture: Inferential statistics are used to make conclusions about a population based on data collected from a sample.
Conceptual Framework: It represents a "Sample → Population leap," where researchers infer that characteristics found in the sample apply to the broader group.
Test Example: A researcher finds that students who sleep more score higher on exams ( $p = .03$ ). This implies the result is unlikely due to chance; therefore, we infer that the findings likely reflect the population.
Exam Traps to Avoid: * Proving Truth: Statistics do not "prove" a hypothesis is true; they only support it. * Descriptive vs. Inferential: Descriptive statistics apply only to the sample; inferential statistics attempt to bridge the gap to the population. * Causality: Do not confuse correlation with causation based solely on inferential statistics.
Memory Tip: Think of it as the "Sample → Population leap."

Null vs. Research Hypotheses

The Null Hypothesis ( $H_0$ ): Postulates that there is no effect, no difference, or no relationship between variables. It is the "boring" explanation.
The Research (Alternative) Hypothesis ( $H_1$ or $H_a$ ): Postulates that there is a significant effect or difference.
Test Example (Caffeine and Memory): * $H_0$ : Caffeine has no effect on memory. * $H_1$ : Caffeine improves memory.
Exam Traps to Avoid: * Accepting the Research Hypothesis: You never "accept" $H_1$ . You either reject the null ( $H_0$ ) or fail to reject the null ( $H_0$ ). * Mixing Hypotheses: Remember that the null hypothesis is always the one stating no difference.
Memory Tip: Null = Nothing.

Probability and Statistical Significance

p-value: The probability that the observed results occurred by chance alone.
Common Alpha Cutoff: The standard threshold for significance is p < .05.
Test Example ( $p = .02$ ): Because $p = .02$ is less than the cutoff of $.05$ , the researcher should reject the null hypothesis ( $H_0$ ), concluding the result is statistically significant.
Exam Traps to Avoid: * Misinterpreting the p-value: A $p = .02$ does NOT mean there is a $2\%$ chance the hypothesis is wrong. It refers to the probability of the data occurring if the null is true. * Significance vs. Importance: Statistical significance does not always equate to practical importance; a tiny, unimportant effect can still be significant.
Memory Tip: $p = \text{probability of chance explanation}$ .

The t-Test (Difference Between Two Groups)

Usage: Used specifically when comparing the means of exactly two groups.
Test Example: Comparing the test scores of students who studied versus those who did not.
Types of t-tests: * Independent t-test: Used for different groups of participants. * Paired t-test: Used for the same participants measured twice.
Directional Tailing: * One-tailed test: Predicts a specific direction (e.g., "Group A will score higher than Group B"). * Two-tailed test: Predicts a difference exists but does not specify the direction (e.g., "Groups will differ").
Exam Traps to Avoid: * Wait-and-See Approach: You cannot choose a one-tailed test after seeing the results. * Ease of Significance: While a one-tailed test makes it easier to reach statistical significance, it is logically stricter.
Memory Tip: One tail = one direction; Two tail = difference either way.

The F-Test (ANOVA – 3+ Groups or Multiple Variables)

Usage: Used for comparing three or more groups or handling multiple independent variables.
Key Concept: ANOVA compares variance rather than just means. The ratio is defined as: * $F = \frac{\text{Systematic Variance}}{\text{Error Variance}}$ * Systematic Variance: Represents the real effect (e.g., the independent variable [IV] is working). * Error Variance: Represents random noise or fluctuation.
Test Example: Comparing the effectiveness of three different teaching methods.
Exam Traps to Avoid: * Specific Group Differences: The F-test itself does NOT tell you which specific groups differ; it only indicates that at least one difference exists. * Post Hoc Tests: If the F-test is significant, you must perform follow-up (post hoc) tests to identify which groups are different.
Memory Tip: $F = \text{"Factor effect vs. Fluctuation (noise)"}$ .

Confidence Intervals (CI)

Definition: A range of values within which the true population value likely falls.
Test Example: If a mean is $75$ and the $95\%$ confidence interval is $[70, 80]$ , we are $95\%$ confident that the real population mean resides between $70$ and $80$ .
Exam Traps to Avoid: * Data Distribution Myth: The interval does NOT mean $95\%$ of all data points fall within that range. * Single Interval Probability: It is technically misleading to say there is a $95\%$ chance the true mean is in "this specific" interval. It refers to the long-run confidence level of the procedure.
Memory Tip: CI = "Where truth probably lives."

Common Combo Scenarios and Advanced Questions

Scenario 1 ( $p = .04$ ): Since it is less than $.05$ , you must reject the null hypothesis ( $H_0$ ).
Scenario 2 (Predicting "A > B"): This requires a one-tailed t-test.
Scenario 3 (Comparing 4 groups): This requires an F-test (ANOVA).
Scenario 4 (CI does not include 0): If a confidence interval for the difference between groups does not include $0$ , the result is likely statistically significant.

Single-Case Experimental Designs

Definition: A study focusing on a single person (or a very small number) repeatedly over time, typically using phases (e.g., baseline vs. treatment).
Test Example: A therapist measures a patient’s anxiety levels for $2$ weeks (baseline), then introduces a specific therapy and continues measurement.
Phase Structure: Often referred to as an A-B design (Baseline - Treatment).
Why Use It?: * Studying rare conditions. * Managing ethical or practical limitations. * Capturing detailed individual patterns.
Exam Traps to Avoid: * Frequency of Measurement: It is NOT a single measurement; it is repeated measures. * The Baseline as Control: The baseline phase acts as the comparison (control). * Generalizability: These designs have low external validity (hard to generalize to the whole population).
Memory Tip: Single-case = "one person, many measurements."

One-Group Experimental Designs

One-Group Posttest-Only Design: * One group → treatment → measure outcome (no pre-treatment data). * Example: Students use a new educational app, then take a test without recording prior scores. * Weakness: You cannot measure change or rule out alternative explanations.
One-Group Pretest–Posttest Design: * Measure (pretest) → treatment → measure again (posttest). * Example: Measure stress → give meditation training → measure stress again. * Issue: Highly susceptible to internal validity threats.

Threats to Internal Validity: "Hi MaTRI"

These threats are common professor favorites for exam traps:

History: An external event happens between measurements (e.g., a massive tutoring program starts while you are testing a reading app).
Maturation: Participants naturally change or grow over time (e.g., children's reading naturally improves as they age).
Testing: The act of taking the first test changes performance on subsequent tests due to practice effects.
Regression Toward the Mean: If starting with extreme scores (very high or very low), those scores naturally move closer to the average over time.
Instrument Decay: The measurement tool or observer becomes less reliable over time (e.g., a tired observer records behavior less accurately).

Memory Tip: "Hi MaTRI" (History, Maturation, Testing, Regression, Instrument decay).

Nonequivalent Control Group Designs

Nonequivalent Control Group Design: * Two groups are compared, but they are NOT randomly assigned. * Example: One school uses a new method while another school uses the old one. * Key Issue: Groups may differ significantly before the study even begins (selection bias).
Nonequivalent Control Group Pretest–Posttest: * Both groups are measured before and after the treatment. * Benefit: Stronger because you can compare the amount of change between groups. * Weakness: Still lacks random assignment, so individual selection differences remain.

Factorial Designs: The Big Idea

Definition: A study involving two or more independent variables (IVs), where each variable has multiple levels.
Notation: A $2 \times 3$ design means: * $2$ levels of IV1. * $3$ levels of IV2. * Total conditions = $2 \times 3 = 6$ .
Test Example: Studying sleep ( $4$ hrs vs. $8$ hrs) and caffeine (yes vs. no).
Purpose: To study multiple variables at once and identify interaction effects.
Memory Tip: Factorial = "factors (IVs) combined."

Factorial Effects: Main vs. Interaction

Main Effects: The overall effect of a single IV while ignoring (averaging across) all other variables. * Example: Does caffeine improve performance overall, regardless of sleep duration?
Interaction Effects: Occurs when the effect of one IV depends on the specific level of another IV. * Example: Caffeine helps memory ONLY when sleep is low; it has no effect when sleep is high. * Language Cues: Watch for phrases like "depends on," "only when," or "different pattern."
Simple Main Effects: A follow-up analysis performed only when an interaction is found. It "zooms in" on one IV at a specific level of another (e.g., checking the effect of caffeine at exactly $4$ hours of sleep).

Types of Factorial Assignment Designs

Independent Groups Design: Different participants are assigned to each condition (e.g., Group 1 gets caffeine, Group 2 does not).
Repeated Measures Design: The same participants experience all conditions (e.g., everyone is tested with and without caffeine). * Risks: Order effects, fatigue, and carryover effects.
Mixed Factorial Design: A combination where one IV is between-subjects (independent groups) and one IV is within-subjects (repeated measures). * Example: All participants are tested with/without caffeine (repeated), but participants are divided into sleep groups ( $4$ hrs vs. $8$ hrs) using different individuals (independent).

External Validity and Generalization

Definition: How well results generalize outside the specific study to other people, settings, and times.
Generalizing Across People: * Sex and Gender: Results from men may not apply to women or across gender identities. * Race and Ethnicity: Cultural or social differences can affect responses. * Culture: Findings may differ between individualistic and collectivist societies.
Sampling Issues: * College Students: Often young, educated, and not representative of the general public. * Volunteers: May be more motivated or have different traits than non-volunteers. * Online Samples: Broader reach but suffer from self-selection bias.
WEIRD Samples: The acronym for samples that are Western, Educated, Industrialized, Rich, and Democratic.
Nonhuman Animal Research: Findings in rats (e.g., learning a maze) can provide clues but have biological and limited generalization to humans.

Research Settings and Replication

Laboratory vs. Real World: Lab settings can be artificial and may not reflect behavior in life.
Pretest Effects: Taking a pretest can make participants more self-aware or reactive, changing their responses.
Researcher Characteristics: The experimenter’s personal style (friendly vs. strict) can influence participant behavior.
Replication: * Exact Replication: Repeating the study using the exact same procedures as the original. * Conceptual Replication: Testing the same underlying idea or hypothesis using a different method or task.

Literature Reviews vs. Meta-Analysis

Literature Review: A qualitative summary of existing research describing trends (uses words).
Meta-Analysis: A quantitative statistical combination of results across many studies to determine an overall effect size (uses numbers).
Memory Tip: Meta = Math.

High-Yield Summary of Exam Traps

Statistically Significant is not synonymous with Important.
Rejecting the Null ( $H_0$ ) is not synonymous with Proving the Research Hypothesis ( $H_1$ ).
p-value is the probability of the data under the null, not the probability that the hypothesis is true.
The F-test (ANOVA) tells you something is different, not specifically which groups are different.
Large sample sizes do not guarantee a representative sample.
One-tailed predictions must be established before the data is collected.
External Validity relates to generalizability, whereas Internal Validity relates to cause-and-effect certainty.