Statistical Thinking: Key Concepts and Inference in Statistical Investigation

Learning Objectives

Define basic elements of a statistical investigation.
Describe the role of p-values and confidence intervals in statistical inference.
Describe the role of random sampling in generalizing conclusions from a sample to a population.
Describe the role of random assignment in drawing cause-and-effect conclusions.
Critique statistical studies.

Introduction to Statistical Thinking (Four Studies Emphasized)

Society increasingly relies on evidence-based decision making; statistics helps draw valid inferences from data.
The module uses four recent research studies to highlight key elements of a statistical investigation.
Emphasis on planning, data examination, inference, and drawing conclusions beyond the observed data.
Example discussed: coffee consumption and life expectancy from Freedman et al. (2012).
Takeaway: Do not rely on anecdote or intuition; use systematic statistical thinking to gain insight from data.
Real-world relevance: data are ubiquitous; statistics guides interpretation for decisions and policies.

The Three-Step Method (context for learning)

Step 1: Plan the study (develop a testable question and data-collection plan).
Step 2: Examine the data (select appropriate graphs, descriptive statistics, patterns, and variability).
Step 3: Infer from the data (assess whether observed patterns could be due to random variation; generalize beyond the sample and consider potential causal interpretations when applicable).
The method helps organize thinking about how to answer research questions and assess the strength of conclusions.

Elements of a Statistical Investigation

Planning the study
- Formulate a testable research question.
- Decide how data will be collected (sampling method, measurements, variables collected).
- Consider study design details: how long the study lasts, recruitment methods, participant demographics (age, smoking, etc.), and any changes imposed (e.g., coffee habit changes).
Examining the data
- Choose appropriate graphs and descriptive statistics to summarize relevant aspects.
- Look for patterns, variability, reliability, and validity.
- Compare distributions (e.g., smoker vs. non-smoker groups) rather than relying solely on centers (means/medians).
Inferring from the data
- Apply valid statistical methods to draw inferences beyond the observed data.
- Assess whether observed effects (e.g., a 10%–15% reduction in risk) could occur by chance alone.
Drawing conclusions
- Determine to whom conclusions apply (external validity: who are the people in the study? ages, health status, location).
- Consider whether the study supports a cause-and-effect conclusion about treatments or exposures.
- Recognize that numerical analysis is only one part of the investigation; interpretation and context are crucial.

Distributional Thinking

Data vary, and the pattern of variation is crucial to understanding phenomena.
Presenting data carefully (distributions) can answer questions and reveal further questions without resorting to overly simplistic summaries.
Comparing only centers (e.g., medians) can be misleading; the full distribution provides more insight.
Example: cancer pamphlets vs. patient reading levels
- 63 patients assessed for reading ability; 30 pamphlets assessed for readability (variables: patient reading level, pamphlet readability).
- Distributions reveal misalignment: many patients have reading levels below the most readable pamphlet (e.g., 17/63 = 27%).
- Figure comparing distributions shows that medians alone miss important variation.
Measurements can have uncertainty due to measurement error, snapshot sampling, or small sample size.
Assessment of evidence requires looking at distributional patterns and variability, not just central tendency.

Statistical Significance: Assessing Random Variation

Example: Hamlin, Wynn, & Bloom (2007) infants study on helping vs. hindering agents
- 16 infants, 14 chose the helper toy after exposure to helper/hinderer scenarios.
- Consider alternative explanations (toy color, shapes, handedness, position) and how they were controlled (rotation of conditions to balance potential effects).
- Acknowledges random variation: could result from chance.
Probability model for the observed result under the null hypothesis of no preference
- If each infant is equally likely to choose either toy, each trial is a Bernoulli with probability p = 0.5 for choosing the helper.
- What is the probability of observing 14 or more helpers in 16 trials?
- Computed p-value: $P(X \ge 14) = 0.0021$ under the null model.
P-value concept
- The p-value tells how often a random process would yield a result as extreme or more extreme than what was observed, assuming random chance is the only factor.
- If the p-value is smaller than the chosen significance level, typically $\alpha = 0.05,$ we reject the null hypothesis of random chance.
Decision rule example
- With p-value = 0.0021 < 0.05, conclude strong evidence of a genuine preference for the helper toy.
Generalizability (external validity) begins here: larger or more representative samples improve generalizability.

Generalizability and Sampling

Generalizability: results from widely representative samples are more likely to generalize to the population.
Limitation: conclusions from a study apply to the specific sample (e.g., the 16 infants) unless sampling is representative.
Random sampling is a key method to generalize findings to a larger population.
How to sample
- Simple form: number each member of the population and randomly select a subset.
- Many real polls use probability-based sampling methods to obtain nationally representative panels.
Example: General Social Survey (GSS)
- Based on a sample of about 2,000 adult Americans.
- Used to infer population proportions on issues like self-identification as liberal, happiness, and feeling rushed.
Margin of error and confidence
- A probability sample yields a margin of error: typically approximated by $\text{ME} \approx \frac{1}{\sqrt{n}}$ (for large populations and simple random samples).
- Example: 2004 GSS reported 83.6% feeling rushed (817/977 respondents).
- 95% confidence that the true population value lies within ± ME of the sample percentage; here, ME ≈ 3 percentage points (since $\frac{1}{\sqrt{977}} \approx 0.032\approx 3\%$ ).
Non-random samples can introduce bias by systematically over- or under-representing segments of the population.
Other sources of error (e.g., dishonest responses) are not captured by the margin of error.

Cause and Effect and Random Assignment

Distinguishing between group differences due to treatment vs. group-formation processes.
Random assignment helps balance both known and unknown variables across groups, making causal conclusions more plausible.
Example 4: intrinsic vs. extrinsic motivation and creativity (Ramsey & Schafer, 2002; Amabile, 1985)
- 47 experienced creative writers were assigned to intrinsic or extrinsic motivation groups via random assignment.
- Observed means: intrinsic = 19.88, extrinsic = 15.74; suggests higher creativity under intrinsic motivation.
- However, variability within groups matters; distributions overlap substantially (Figure 2).
- Standard deviations: extrinsic SD = 5.25; intrinsic SD = 4.40.
- Because means differ but not enormously, random assignment is crucial to isolate the treatment effect.
What random assignment accomplishes
- Tends to balance all variables (known and unknown) across groups, making differences more attributable to the treatment.
- A potential unlucky draw could still exist; we quantify this with a p-value under the null that the treatment has no effect.
How to test the assignment effect without assuming different populations
- Treat the observed scores as if the same person’s score would be the same regardless of group, and simulate random reassignment many times.
- Example: 1,000 hypothetical random assignments; observed difference = 4.14 points (19.88 − 15.74).
- Only 2 of 1,000 simulated random assignments produced a difference as large or larger than 4.41 (they used a different number for the simulated difference in the text), giving an approximate p-value of $\frac{2}{1000} = 0.002\$.</li><li>Result: very unlikely that the observed difference arose by chance due to random assignment alone; supports a causal interpretation that intrinsic motivation increases creativity scores in this sample.</li></ul></li><li>Caution on generalization from randomized experiments<ul><li>Generalize cautiously to individuals similar to those in the study (extensive creative writing experience).</li><li>We need more information about the sampling process to generalize to broader populations.</li></ul></li></ul><h3 id="8adc2d3c-b038-4688-a031-3e7586168b93" data-toc-id="8adc2d3c-b038-4688-a031-3e7586168b93" collapsed="false" seolevelmigrated="true">The Importance of Diversity in Psychological Science</h3><ul><li>Diversity considerations go beyond sex/gender dichotomies; recognizing race, age, geography, socioeconomic status, and more.</li><li>The field has historically used binary gender categories, which may fail to capture the diversity of identities.</li><li>Diversity and inclusion are central themes that influence interpretation and generalizability of research findings.</li><li>The course notes that gender, sex, and related topics will be addressed in later units, highlighting the need to examine these topics carefully.</li><li>Emphasis on asking questions about how representative the sample is and how findings may generalize across diverse populations.</li></ul><h3 id="16e8fbd1-c628-4db9-807a-461a1b62eeb0" data-toc-id="16e8fbd1-c628-4db9-807a-461a1b62eeb0" collapsed="false" seolevelmigrated="true">The Scientific Method and the Role of Randomness in Inference</h3><ul><li>The scientific method in psychology involves: hypothesize → design a study → conduct the study → analyze the data → report results.</li><li>Statistical thinking requires careful study design, pattern analysis, and conclusions that go beyond the observed data.</li><li>Random sampling is essential for generalizing results to a population; random assignment is essential for causal conclusions.</li><li>Probability models help quantify how much random variation to expect and to determine if observed results could occur by chance.</li><li>Margin of error and confidence levels provide a framework to express uncertainty in estimates.</li></ul><h3 id="d82e82a5-3b72-4eea-b3fa-add89c9b52d6" data-toc-id="d82e82a5-3b72-4eea-b3fa-add89c9b52d6" collapsed="false" seolevelmigrated="true">The Coffee Study Case (Long-Run Evidence and Cautions)</h3><ul><li>The discussed coffee study (Freedman et al., 2012) is a large, 14-year observational study published in a major journal (New England Journal of Medicine).</li><li>Study design and scope<ul><li>More than 402,000 people aged 50–71 from six states and two metropolitan areas.</li><li>Excluded individuals with cancer, heart disease, or stroke at baseline.</li><li>Coffee consumption assessed once at baseline.</li></ul></li><li>Key findings<ul><li>About 52,000 deaths occurred during follow-up.</li><li>Higher coffee consumption associated with lower death risk; reductions more pronounced for those drinking six or more cups daily.</li><li>No clear difference between caffeinated vs. decaffeinated coffee effects.</li></ul></li><li>Important interpretation cautions<ul><li>This was an observational study; therefore, no causal conclusions can be drawn about coffee causing increased longevity.</li><li>Possible confounding factors: people with chronic diseases might avoid coffee, among other potential confounders.</li><li>Results should be reviewed in the context of similar studies and across study designs to assess consistency and plausibility.</li><li>Statistical adjustment can address some confounders, but not all; residual confounding remains a concern.</li></ul></li><li>Implications for policy and decision making<ul><li>Observational findings can inform hypotheses and guide future focused studies, including randomized experiments where feasible.</li></ul></li></ul><h3 id="05cb7049-4cfe-42c3-be8b-aa379178433b" data-toc-id="05cb7049-4cfe-42c3-be8b-aa379178433b" collapsed="false" seolevelmigrated="true">Summary and Practical Takeaways</h3><ul><li>A statistical investigation comprises planning, data examination, inference, and drawing cautious conclusions about populations and causal relationships.</li><li>Distributional thinking emphasizes examining full data distributions, not just centers, to avoid misleading conclusions.</li><li>P-values quantify how unlikely observed results are under a null hypothesis; small p-values suggest rejecting random chance as an explanation, given a chosen significance level$ \alpha\approx 0.05 $.</li><li>Random sampling supports generalizability to a population; margin of error quantifies the expected range of variation due to sampling randomness, with approximate formula$ \text{ME} \approx \frac{1}{\sqrt{n}} $for proportions in large samples.</li><li>Random assignment supports causal interpretations by balancing confounding variables across groups; observed differences in outcomes under randomization require examination of how often such differences would occur by chance (p-value from permutation or simulation tests).</li><li>Diversity and inclusivity are essential for the external validity of psychological science; findings may not generalize across all populations if samples lack representativeness.</li><li>In interpreting studies, distinguish between evidence of association (observational) and evidence of causation (randomized experiments), while considering the broader literature and methodological limitations.</li></ul><h3 id="0677640e-574c-495a-9725-5defc4b1bf95" data-toc-id="0677640e-574c-495a-9725-5defc4b1bf95" collapsed="false" seolevelmigrated="true">Key Definitions and Concepts (glossary)</h3><ul><li>Population: the entire group of interest from which a sample is drawn.</li><li>Sample: a subset of the population selected for study.</li><li>Random sampling: a sampling method where every member of the population has an equal chance of being chosen; facilitates generalizability and corrects for sampling bias.</li><li>Margin of error (ME): the range within which the sample statistic is expected to fall from the population parameter in repeated sampling; for proportions, approximated by$ \text{ME} \approx \frac{1}{\sqrt{n}}\,\text{(in proportion terms)} $.</li><li>Confidence level: the probability that the margin of error actually contains the population parameter in repeated sampling (e.g., 95%).</li><li>P-value: the probability, under the null hypothesis, of obtaining a result as extreme or more extreme than the observed one.</li><li>Level of significance (alpha): the threshold for deciding whether to reject the null hypothesis (commonly$ \alpha = 0.05 $).</li><li>Random assignment: allocating participants to groups by chance to ensure equivalence of groups on average.</li><li>Observational study: a study where the researcher observes variables without manipulating the study environment; can show associations but not causation.</li><li>Causation vs. association: causation implies the exposure directly changes the outcome; association indicates a relationship but not necessarily a causal link.</li><li>Bias: systematic error that leads to incorrect conclusions due to the sampling method, measurement, or other processes.</li><li>Variability/distribution: how data points spread around a central tendency; understanding distribution is essential for interpreting patterns.</li></ul><h3 id="ba798c16-4c58-4f44-8445-ba711b256a1d" data-toc-id="ba798c16-4c58-4f44-8445-ba711b256a1d" collapsed="false" seolevelmigrated="true">Notable Numerical References and Equations (LaTeX)</h3><ul><li>Probability of observing 14 or more heads in 16 Bernoulli trials with p = 0.5 under the null:$ P(X\ge 14) = 0.0021 $</li><li>Difference in means in Example 4:$ \Delta = \bar{x}{\text{intrinsic}} - \bar{x}{\text{extrinsic}} = 19.88 - 15.74 = 4.14 $</li><li>Reported standard deviations in creativity scores: extrinsic$ \sigma{E} = 5.25 $, intrinsic$ \sigma{I} = 4.40 $</li><li>Observed mean difference:$ \Delta = 4.14 $(as above)</li><li>Large-scale coffee study details: sample size over 402{,}000; age range 50–71; follow-up duration 14 years; number of deaths ~52{,}000; years and groups not broken down beyond coffee intake categories.</li><li>Margin of error example for GSS (2004): around$ \pm 3\% $with 95% confidence, given sample size around 977; margin approximates$ \text{ME} \approx \frac{1}{\sqrt{977}} \approx 0.032 \approx 3\% $.</li><li>Causal inference via random assignment: probability model and simulations show that observed differences in means under random assignment are unlikely to occur by chance alone (example p-value ≈$ 0.002\$).
// End of notes