P-values, Generalizability, and Experimental Control

Understanding P-values and Generalizability in Research

P-value: What it is and How to Interpret It

  • Definition: The p-value indicates how often a random process would yield a result at least as extreme as what was found in an actual study, assuming only random chance was at play.
  • Example: Infant Preference Study
    • Scenario: Infants were given a choice between two toys: a helper toy and another toy.
    • Assumption (Null Hypothesis): Infants had no preference, meaning each infant had a 50%50\% chance of picking the helper toy, similar to tossing a coin.
    • Observed Outcome: 1414 out of 1616 infants picked the helper toy.
    • Probability Calculation: How likely is it to get 1414 (or more) heads in 1616 coin tosses? This is about as likely as getting 99 heads in a row.
    • Calculated P-value: The probability that 1414 or more out of 1616 infants would choose the helper toy, assuming no preference, is 0.00210.0021.
    • Interpretation of P-value (0.00210.0021): This means such an outcome would occur only about 2.12.1 times in 1,0001,000 iterations of a purely random process (e.g., 1616 coin flips).
  • Logical Possibilities with a Low P-value:
    1. The infants have a genuine preference for the helper toy.
    2. The infants have no preference (50/5050/50), and a very rare chance event (occurring 22 out of 1,0001,000 times) happened in this study.
  • Conclusion: Because the p-value of 0.00210.0021 is very small, researchers conclude there is very strong evidence that these infants have a genuine preference for the helper toy.
  • Level of Significance (Alpha, α\alpha):
    • The p-value is often compared to a pre-determined cut-off value, typically α=0.05\alpha = 0.05.
    • Decision Rule: If the p-value is smaller than the level of significance (p < \alpha), then the hypothesis that only random chance was at play is rejected.
    • In the Infant Study: Since 0.0021 < 0.05, researchers would reject the idea that infants had no preference, concluding that significantly more than half of the infants showed a genuine preference for the helper toy and its helping behavior.

Generalizability in Research

  • Definition: Generalizability refers to the extent to which conclusions drawn from a study's sample can be applied to a larger group of individuals (the population).
  • Limitation of the Infant Study: The conclusion strictly applies only to the 1616 infants in that study because the selection method for these infants is unknown, making it hard to generalize to a broader infant population.
  • Importance of Sampling: To generalize findings, a subset of individuals (a sample) must be selected from a much larger group (the population) in a way that allows the sample's conclusions to extend to the population.
  • Polling Analogy: This is a daily concern for pollsters.
  • Example: The General Social Survey (GSS)
    • Purpose: An annual survey on societal trends in the United States, used to make claims about the U.S. adult population (e.g., percentage identifying as "liberal," "happy," or feeling "rushed").
    • Sample Size: Typically based on about 2,0002,000 adult Americans.
    • Key to Generalizability: How the sample is selected is crucial for making claims about the broader population of all American adults.
    • Random Sampling:
      • Goal: To obtain a sample representative of the population.
      • Method: A common way is to select a random sample, giving every member of the population an equal chance of being selected.
      • Simplest Form: Listing all population members and using a computer to randomly select a subset.
      • Real-world Polls: Often use probability-based sampling methods from nationally representative panels rather than simple random sampling.
  • GSS Data Example: Feeling Rushed
    • Finding: GSS reported that 817817 of 977977 respondents (83.6%83.6\%) indicated they "always or sometimes" feel rushed.
    • Considering Variation: Random sampling inherently introduces variation.
    • Probability Model Application: A coin-toss model can be used when the population size is much larger than the sample size, keeping the probability the same for each individual.
    • Margin of Error:
      • Roughly 1/sample  size1/\sqrt{sample\;size}.
      • The probability model predicts the sample result will be within 33 percentage points of the true population value.
      • Confidence Interval: A statistician would conclude, with 95%95\% confidence, that between 80.6%80.6\% and 86.6%86.6\% of adult Americans in 20042004 would have reported feeling rushed.
      • Meaning: When using a probability sampling method, the margin of error allows researchers to make claims about how often (in the long run, with repeated random sampling) the sample result would fall within a certain distance from the unknown population value due to chance (random sampling variation).
  • Bias in Non-random Samples: Non-random samples are often prone to bias, meaning the sampling method systematically over-represents some segments of the population.

Causation vs. Association

  • Distinction: Association (or correlation) does not equal causation.
  • Example: Teething Babies: When babies get their first teeth, saliva production increases, but increased saliva does not cause them to get teeth.
  • Cause and Effect Studies: The primary question often concerns differences between groups.
  • Group Formation:
    • Observational Studies: Researchers observe pre-existing groups (e.g., coffee drinkers vs. non-coffee drinkers).
    • Experimental Studies: Researchers actively form the groups themselves.
    • Challenge: Could observed differences be an artifact of the group-formation process, or is the difference large enough to discount chance? Is there a "fluke" in the group formation process?

Controlling for Variables in Experiments

  • Importance of Control: In experiments, it is crucial to control for as many variables as might affect the outcome as possible to isolate the effect of the variable of interest.
  • Revisiting the Infant Study-Control Measures:
    • Toy Color and Shape: Prior to data collection, researchers ensured each color and shape (e.g., red square, blue circle) was seen by an equal number of infants.
    • Handedness: Prior to data collection, researchers arranged for half the infants to see the helper toy on the right and half on the left, to account for potential right-handed tendencies.
    • Wooden Character Shapes: Researchers controlled for this by rotating which shape (square, triangle, circle) represented the helper, hinderer, and climber roles.
  • Inherent Randomness: Even with controls, there's always some inherent randomness. If the same 1616 infants were re-tested, they might not make the same choices. A probability model can investigate long-term patterns if chance were the only factor.

Example: Motivation and Creativity Study (Amabile, 1985; Ramsey & Schafer, 2002)

  • Research Question: Does the type of motivation (intrinsic vs. extrinsic) affect creativity scores?
  • Subjects: 4747 people with extensive creative writing experience.
  • Procedure:
    1. Subjects answered survey questions about either intrinsic motivations (e.g., pleasure of self-expression) or extrinsic motivations (e.g., public recognition).
    2. All subjects then wrote a haiku.
    3. A panel of judges evaluated the haikus for creativity (higher scores indicate more creativity).
  • Researchers' Conjecture: Subjects thinking about intrinsic motivations would display more creativity.
  • Results (Figure 2 Visual Representation): Both groups showed considerable variability, and scores had considerable overlap, meaning it's not always true that one group has higher creativity, but there might be a statistical tendency.
    • Psychologist Keith Stanovich (2013) refers to difficulties in thinking about probabilistic tendencies as "the Achilles heel of human cognition."
  • Mean Creativity Scores:
    • Intrinsic Group: 19.8819.88
    • Extrinsic Group: 15.7415.74 (supports the conjecture)
  • Considering Variability: Comparing only means is insufficient; variability must be considered.
  • Standard Deviation: Measures variability.
    • Extrinsic Group: 5.255.25
    • Intrinsic Group: 4.404.40
    • Interpretation: Most creativity scores are within about 55 points of the mean in each group.
    • The mean score for the intrinsic group (19.8819.88) falls within one standard deviation of the mean score for the extrinsic group (meaning it is within the range of 15.74±5.2515.74 \pm 5.25 which is 10.4910.49 to 20.9920.99). Therefore, while there's a tendency for intrinsic scores to be higher, the difference is not