Exhaustive Guide to Inferential Statistics, Experimental Designs, and External Validity

Inferential Statistics

  • Definition and Big Picture: Inferential statistics are used to make conclusions about a population based on data collected from a sample.

  • Conceptual Framework: It represents a "Sample → Population leap," where researchers infer that characteristics found in the sample apply to the broader group.

  • Test Example: A researcher finds that students who sleep more score higher on exams (p=.03p = .03). This implies the result is unlikely due to chance; therefore, we infer that the findings likely reflect the population.

  • Exam Traps to Avoid:     * Proving Truth: Statistics do not "prove" a hypothesis is true; they only support it.     * Descriptive vs. Inferential: Descriptive statistics apply only to the sample; inferential statistics attempt to bridge the gap to the population.     * Causality: Do not confuse correlation with causation based solely on inferential statistics.

  • Memory Tip: Think of it as the "Sample → Population leap."

Null vs. Research Hypotheses

  • The Null Hypothesis (H0H_0): Postulates that there is no effect, no difference, or no relationship between variables. It is the "boring" explanation.

  • The Research (Alternative) Hypothesis (H1H_1 or HaH_a): Postulates that there is a significant effect or difference.

  • Test Example (Caffeine and Memory):     * H0H_0: Caffeine has no effect on memory.     * H1H_1: Caffeine improves memory.

  • Exam Traps to Avoid:     * Accepting the Research Hypothesis: You never "accept" H1H_1. You either reject the null (H0H_0) or fail to reject the null (H0H_0).     * Mixing Hypotheses: Remember that the null hypothesis is always the one stating no difference.

  • Memory Tip: Null = Nothing.

Probability and Statistical Significance

  • p-value: The probability that the observed results occurred by chance alone.

  • Common Alpha Cutoff: The standard threshold for significance is p < .05.

  • Test Example (p=.02p = .02): Because p=.02p = .02 is less than the cutoff of .05.05, the researcher should reject the null hypothesis (H0H_0), concluding the result is statistically significant.

  • Exam Traps to Avoid:     * Misinterpreting the p-value: A p=.02p = .02 does NOT mean there is a 2%2\% chance the hypothesis is wrong. It refers to the probability of the data occurring if the null is true.     * Significance vs. Importance: Statistical significance does not always equate to practical importance; a tiny, unimportant effect can still be significant.

  • Memory Tip: p=probability of chance explanationp = \text{probability of chance explanation}.

The t-Test (Difference Between Two Groups)

  • Usage: Used specifically when comparing the means of exactly two groups.

  • Test Example: Comparing the test scores of students who studied versus those who did not.

  • Types of t-tests:     * Independent t-test: Used for different groups of participants.     * Paired t-test: Used for the same participants measured twice.

  • Directional Tailing:     * One-tailed test: Predicts a specific direction (e.g., "Group A will score higher than Group B").     * Two-tailed test: Predicts a difference exists but does not specify the direction (e.g., "Groups will differ").

  • Exam Traps to Avoid:     * Wait-and-See Approach: You cannot choose a one-tailed test after seeing the results.     * Ease of Significance: While a one-tailed test makes it easier to reach statistical significance, it is logically stricter.

  • Memory Tip: One tail = one direction; Two tail = difference either way.

The F-Test (ANOVA – 3+ Groups or Multiple Variables)

  • Usage: Used for comparing three or more groups or handling multiple independent variables.

  • Key Concept: ANOVA compares variance rather than just means. The ratio is defined as:     * F=Systematic VarianceError VarianceF = \frac{\text{Systematic Variance}}{\text{Error Variance}}     * Systematic Variance: Represents the real effect (e.g., the independent variable [IV] is working).     * Error Variance: Represents random noise or fluctuation.

  • Test Example: Comparing the effectiveness of three different teaching methods.

  • Exam Traps to Avoid:     * Specific Group Differences: The F-test itself does NOT tell you which specific groups differ; it only indicates that at least one difference exists.     * Post Hoc Tests: If the F-test is significant, you must perform follow-up (post hoc) tests to identify which groups are different.

  • Memory Tip: F="Factor effect vs. Fluctuation (noise)"F = \text{"Factor effect vs. Fluctuation (noise)"}.

Confidence Intervals (CI)

  • Definition: A range of values within which the true population value likely falls.

  • Test Example: If a mean is 7575 and the 95%95\% confidence interval is [70,80][70, 80], we are 95%95\% confident that the real population mean resides between 7070 and 8080.

  • Exam Traps to Avoid:     * Data Distribution Myth: The interval does NOT mean 95%95\% of all data points fall within that range.     * Single Interval Probability: It is technically misleading to say there is a 95%95\% chance the true mean is in "this specific" interval. It refers to the long-run confidence level of the procedure.

  • Memory Tip: CI = "Where truth probably lives."

Common Combo Scenarios and Advanced Questions

  • Scenario 1 (p=.04p = .04): Since it is less than .05.05, you must reject the null hypothesis (H0H_0).

  • Scenario 2 (Predicting "A > B"): This requires a one-tailed t-test.

  • Scenario 3 (Comparing 4 groups): This requires an F-test (ANOVA).

  • Scenario 4 (CI does not include 0): If a confidence interval for the difference between groups does not include 00, the result is likely statistically significant.

Single-Case Experimental Designs

  • Definition: A study focusing on a single person (or a very small number) repeatedly over time, typically using phases (e.g., baseline vs. treatment).

  • Test Example: A therapist measures a patient’s anxiety levels for 22 weeks (baseline), then introduces a specific therapy and continues measurement.

  • Phase Structure: Often referred to as an A-B design (Baseline - Treatment).

  • Why Use It?:     * Studying rare conditions.     * Managing ethical or practical limitations.     * Capturing detailed individual patterns.

  • Exam Traps to Avoid:     * Frequency of Measurement: It is NOT a single measurement; it is repeated measures.     * The Baseline as Control: The baseline phase acts as the comparison (control).     * Generalizability: These designs have low external validity (hard to generalize to the whole population).

  • Memory Tip: Single-case = "one person, many measurements."

One-Group Experimental Designs

  • One-Group Posttest-Only Design:     * One group → treatment → measure outcome (no pre-treatment data).     * Example: Students use a new educational app, then take a test without recording prior scores.     * Weakness: You cannot measure change or rule out alternative explanations.

  • One-Group Pretest–Posttest Design:     * Measure (pretest) → treatment → measure again (posttest).     * Example: Measure stress → give meditation training → measure stress again.     * Issue: Highly susceptible to internal validity threats.

Threats to Internal Validity: "Hi MaTRI"

These threats are common professor favorites for exam traps:

  1. History: An external event happens between measurements (e.g., a massive tutoring program starts while you are testing a reading app).

  2. Maturation: Participants naturally change or grow over time (e.g., children's reading naturally improves as they age).

  3. Testing: The act of taking the first test changes performance on subsequent tests due to practice effects.

  4. Regression Toward the Mean: If starting with extreme scores (very high or very low), those scores naturally move closer to the average over time.

  5. Instrument Decay: The measurement tool or observer becomes less reliable over time (e.g., a tired observer records behavior less accurately).

  • Memory Tip: "Hi MaTRI" (History, Maturation, Testing, Regression, Instrument decay).

Nonequivalent Control Group Designs

  • Nonequivalent Control Group Design:     * Two groups are compared, but they are NOT randomly assigned.     * Example: One school uses a new method while another school uses the old one.     * Key Issue: Groups may differ significantly before the study even begins (selection bias).

  • Nonequivalent Control Group Pretest–Posttest:     * Both groups are measured before and after the treatment.     * Benefit: Stronger because you can compare the amount of change between groups.     * Weakness: Still lacks random assignment, so individual selection differences remain.

Factorial Designs: The Big Idea

  • Definition: A study involving two or more independent variables (IVs), where each variable has multiple levels.

  • Notation: A 2×32 \times 3 design means:     * 22 levels of IV1.     * 33 levels of IV2.     * Total conditions = 2×3=62 \times 3 = 6.

  • Test Example: Studying sleep (44 hrs vs. 88 hrs) and caffeine (yes vs. no).

  • Purpose: To study multiple variables at once and identify interaction effects.

  • Memory Tip: Factorial = "factors (IVs) combined."

Factorial Effects: Main vs. Interaction

  • Main Effects: The overall effect of a single IV while ignoring (averaging across) all other variables.     * Example: Does caffeine improve performance overall, regardless of sleep duration?

  • Interaction Effects: Occurs when the effect of one IV depends on the specific level of another IV.     * Example: Caffeine helps memory ONLY when sleep is low; it has no effect when sleep is high.     * Language Cues: Watch for phrases like "depends on," "only when," or "different pattern."

  • Simple Main Effects: A follow-up analysis performed only when an interaction is found. It "zooms in" on one IV at a specific level of another (e.g., checking the effect of caffeine at exactly 44 hours of sleep).

Types of Factorial Assignment Designs

  • Independent Groups Design: Different participants are assigned to each condition (e.g., Group 1 gets caffeine, Group 2 does not).

  • Repeated Measures Design: The same participants experience all conditions (e.g., everyone is tested with and without caffeine).     * Risks: Order effects, fatigue, and carryover effects.

  • Mixed Factorial Design: A combination where one IV is between-subjects (independent groups) and one IV is within-subjects (repeated measures).     * Example: All participants are tested with/without caffeine (repeated), but participants are divided into sleep groups (44 hrs vs. 88 hrs) using different individuals (independent).

External Validity and Generalization

  • Definition: How well results generalize outside the specific study to other people, settings, and times.

  • Generalizing Across People:     * Sex and Gender: Results from men may not apply to women or across gender identities.     * Race and Ethnicity: Cultural or social differences can affect responses.     * Culture: Findings may differ between individualistic and collectivist societies.

  • Sampling Issues:     * College Students: Often young, educated, and not representative of the general public.     * Volunteers: May be more motivated or have different traits than non-volunteers.     * Online Samples: Broader reach but suffer from self-selection bias.

  • WEIRD Samples: The acronym for samples that are Western, Educated, Industrialized, Rich, and Democratic.

  • Nonhuman Animal Research: Findings in rats (e.g., learning a maze) can provide clues but have biological and limited generalization to humans.

Research Settings and Replication

  • Laboratory vs. Real World: Lab settings can be artificial and may not reflect behavior in life.

  • Pretest Effects: Taking a pretest can make participants more self-aware or reactive, changing their responses.

  • Researcher Characteristics: The experimenter’s personal style (friendly vs. strict) can influence participant behavior.

  • Replication:     * Exact Replication: Repeating the study using the exact same procedures as the original.     * Conceptual Replication: Testing the same underlying idea or hypothesis using a different method or task.

Literature Reviews vs. Meta-Analysis

  • Literature Review: A qualitative summary of existing research describing trends (uses words).

  • Meta-Analysis: A quantitative statistical combination of results across many studies to determine an overall effect size (uses numbers).

  • Memory Tip: Meta = Math.

High-Yield Summary of Exam Traps

  • Statistically Significant is not synonymous with Important.

  • Rejecting the Null (H0H_0) is not synonymous with Proving the Research Hypothesis (H1H_1).

  • p-value is the probability of the data under the null, not the probability that the hypothesis is true.

  • The F-test (ANOVA) tells you something is different, not specifically which groups are different.

  • Large sample sizes do not guarantee a representative sample.

  • One-tailed predictions must be established before the data is collected.

  • External Validity relates to generalizability, whereas Internal Validity relates to cause-and-effect certainty.