P-values, Generalizability, and Experimental Control

Understanding P-values and Generalizability in Research

P-value: What it is and How to Interpret It

Definition: The p-value indicates how often a random process would yield a result at least as extreme as what was found in an actual study, assuming only random chance was at play.
Example: Infant Preference Study
- Scenario: Infants were given a choice between two toys: a helper toy and another toy.
- Assumption (Null Hypothesis): Infants had no preference, meaning each infant had a $50\%$ chance of picking the helper toy, similar to tossing a coin.
- Observed Outcome: $14$ out of $16$ infants picked the helper toy.
- Probability Calculation: How likely is it to get $14$ (or more) heads in $16$ coin tosses? This is about as likely as getting $9$ heads in a row.
- Calculated P-value: The probability that $14$ or more out of $16$ infants would choose the helper toy, assuming no preference, is $0.0021$ .
- Interpretation of P-value ( $0.0021$ ): This means such an outcome would occur only about $2.1$ times in $1,000$ iterations of a purely random process (e.g., $16$ coin flips).
Logical Possibilities with a Low P-value:
1. The infants have a genuine preference for the helper toy.
2. The infants have no preference ( $50/50$ ), and a very rare chance event (occurring $2$ out of $1,000$ times) happened in this study.
Conclusion: Because the p-value of $0.0021$ is very small, researchers conclude there is very strong evidence that these infants have a genuine preference for the helper toy.
Level of Significance (Alpha, $\alpha$ ):
- The p-value is often compared to a pre-determined cut-off value, typically $\alpha = 0.05$ .
- Decision Rule: If the p-value is smaller than the level of significance (p < \alpha), then the hypothesis that only random chance was at play is rejected.
- In the Infant Study: Since 0.0021 < 0.05, researchers would reject the idea that infants had no preference, concluding that significantly more than half of the infants showed a genuine preference for the helper toy and its helping behavior.

Generalizability in Research

Definition: Generalizability refers to the extent to which conclusions drawn from a study's sample can be applied to a larger group of individuals (the population).
Limitation of the Infant Study: The conclusion strictly applies only to the $16$ infants in that study because the selection method for these infants is unknown, making it hard to generalize to a broader infant population.
Importance of Sampling: To generalize findings, a subset of individuals (a sample) must be selected from a much larger group (the population) in a way that allows the sample's conclusions to extend to the population.
Polling Analogy: This is a daily concern for pollsters.
Example: The General Social Survey (GSS)
- Purpose: An annual survey on societal trends in the United States, used to make claims about the U.S. adult population (e.g., percentage identifying as "liberal," "happy," or feeling "rushed").
- Sample Size: Typically based on about $2,000$ adult Americans.
- Key to Generalizability: How the sample is selected is crucial for making claims about the broader population of all American adults.
- Random Sampling:
  - Goal: To obtain a sample representative of the population.
  - Method: A common way is to select a random sample, giving every member of the population an equal chance of being selected.
  - Simplest Form: Listing all population members and using a computer to randomly select a subset.
  - Real-world Polls: Often use probability-based sampling methods from nationally representative panels rather than simple random sampling.
GSS Data Example: Feeling Rushed
- Finding: GSS reported that $817$ of $977$ respondents ( $83.6\%$ ) indicated they "always or sometimes" feel rushed.
- Considering Variation: Random sampling inherently introduces variation.
- Probability Model Application: A coin-toss model can be used when the population size is much larger than the sample size, keeping the probability the same for each individual.
- Margin of Error:
  - Roughly $1/\sqrt{sample\;size}$ .
  - The probability model predicts the sample result will be within $3$ percentage points of the true population value.
  - Confidence Interval: A statistician would conclude, with $95\%$ confidence, that between $80.6\%$ and $86.6\%$ of adult Americans in $2004$ would have reported feeling rushed.
  - Meaning: When using a probability sampling method, the margin of error allows researchers to make claims about how often (in the long run, with repeated random sampling) the sample result would fall within a certain distance from the unknown population value due to chance (random sampling variation).
Bias in Non-random Samples: Non-random samples are often prone to bias, meaning the sampling method systematically over-represents some segments of the population.

Causation vs. Association

Distinction: Association (or correlation) does not equal causation.
Example: Teething Babies: When babies get their first teeth, saliva production increases, but increased saliva does not cause them to get teeth.
Cause and Effect Studies: The primary question often concerns differences between groups.
Group Formation:
- Observational Studies: Researchers observe pre-existing groups (e.g., coffee drinkers vs. non-coffee drinkers).
- Experimental Studies: Researchers actively form the groups themselves.
- Challenge: Could observed differences be an artifact of the group-formation process, or is the difference large enough to discount chance? Is there a "fluke" in the group formation process?

Controlling for Variables in Experiments

Importance of Control: In experiments, it is crucial to control for as many variables as might affect the outcome as possible to isolate the effect of the variable of interest.
Revisiting the Infant Study-Control Measures:
- Toy Color and Shape: Prior to data collection, researchers ensured each color and shape (e.g., red square, blue circle) was seen by an equal number of infants.
- Handedness: Prior to data collection, researchers arranged for half the infants to see the helper toy on the right and half on the left, to account for potential right-handed tendencies.
- Wooden Character Shapes: Researchers controlled for this by rotating which shape (square, triangle, circle) represented the helper, hinderer, and climber roles.
Inherent Randomness: Even with controls, there's always some inherent randomness. If the same $16$ infants were re-tested, they might not make the same choices. A probability model can investigate long-term patterns if chance were the only factor.

Example: Motivation and Creativity Study (Amabile, 1985; Ramsey & Schafer, 2002)

Research Question: Does the type of motivation (intrinsic vs. extrinsic) affect creativity scores?
Subjects: $47$ people with extensive creative writing experience.
Procedure:
1. Subjects answered survey questions about either intrinsic motivations (e.g., pleasure of self-expression) or extrinsic motivations (e.g., public recognition).
2. All subjects then wrote a haiku.
3. A panel of judges evaluated the haikus for creativity (higher scores indicate more creativity).
Researchers' Conjecture: Subjects thinking about intrinsic motivations would display more creativity.
Results (Figure 2 Visual Representation): Both groups showed considerable variability, and scores had considerable overlap, meaning it's not always true that one group has higher creativity, but there might be a statistical tendency.
- Psychologist Keith Stanovich (2013) refers to difficulties in thinking about probabilistic tendencies as "the Achilles heel of human cognition."
Mean Creativity Scores:
- Intrinsic Group: $19.88$
- Extrinsic Group: $15.74$ (supports the conjecture)
Considering Variability: Comparing only means is insufficient; variability must be considered.
Standard Deviation: Measures variability.
- Extrinsic Group: $5.25$
- Intrinsic Group: $4.40$
- Interpretation: Most creativity scores are within about $5$ points of the mean in each group.
- The mean score for the intrinsic group ( $19.88$ ) falls within one standard deviation of the mean score for the extrinsic group (meaning it is within the range of $15.74 \pm 5.25$ which is $10.49$ to $20.99$ ). Therefore, while there's a tendency for intrinsic scores to be higher, the difference is not