P-values, Generalizability, and Experimental Control
Understanding P-values and Generalizability in Research
P-value: What it is and How to Interpret It
Definition: The p-value indicates how often a random process would yield a result at least as extreme as what was found in an actual study, assuming only random chance was at play.
Example: Infant Preference Study
Scenario: Infants were given a choice between two toys: a helper toy and another toy.
Assumption (Null Hypothesis): Infants had no preference, meaning each infant had a 50% chance of picking the helper toy, similar to tossing a coin.
Observed Outcome:14 out of 16 infants picked the helper toy.
Probability Calculation: How likely is it to get 14 (or more) heads in 16 coin tosses? This is about as likely as getting 9 heads in a row.
Calculated P-value: The probability that 14 or more out of 16 infants would choose the helper toy, assuming no preference, is 0.0021.
Interpretation of P-value (0.0021): This means such an outcome would occur only about 2.1 times in 1,000 iterations of a purely random process (e.g., 16 coin flips).
Logical Possibilities with a Low P-value:
The infants have a genuine preference for the helper toy.
The infants have no preference (50/50), and a very rare chance event (occurring 2 out of 1,000 times) happened in this study.
Conclusion: Because the p-value of 0.0021 is very small, researchers conclude there is very strong evidence that these infants have a genuine preference for the helper toy.
Level of Significance (Alpha, α):
The p-value is often compared to a pre-determined cut-off value, typically α=0.05.
Decision Rule: If the p-value is smaller than the level of significance (p < \alpha), then the hypothesis that only random chance was at play is rejected.
In the Infant Study: Since 0.0021 < 0.05, researchers would reject the idea that infants had no preference, concluding that significantly more than half of the infants showed a genuine preference for the helper toy and its helping behavior.
Generalizability in Research
Definition: Generalizability refers to the extent to which conclusions drawn from a study's sample can be applied to a larger group of individuals (the population).
Limitation of the Infant Study: The conclusion strictly applies only to the 16 infants in that study because the selection method for these infants is unknown, making it hard to generalize to a broader infant population.
Importance of Sampling: To generalize findings, a subset of individuals (a sample) must be selected from a much larger group (the population) in a way that allows the sample's conclusions to extend to the population.
Polling Analogy: This is a daily concern for pollsters.
Example: The General Social Survey (GSS)
Purpose: An annual survey on societal trends in the United States, used to make claims about the U.S. adult population (e.g., percentage identifying as "liberal," "happy," or feeling "rushed").
Sample Size: Typically based on about 2,000 adult Americans.
Key to Generalizability: How the sample is selected is crucial for making claims about the broader population of all American adults.
Random Sampling:
Goal: To obtain a sample representative of the population.
Method: A common way is to select a random sample, giving every member of the population an equal chance of being selected.
Simplest Form: Listing all population members and using a computer to randomly select a subset.
Real-world Polls: Often use probability-based sampling methods from nationally representative panels rather than simple random sampling.
GSS Data Example: Feeling Rushed
Finding: GSS reported that 817 of 977 respondents (83.6%) indicated they "always or sometimes" feel rushed.
Considering Variation: Random sampling inherently introduces variation.
Probability Model Application: A coin-toss model can be used when the population size is much larger than the sample size, keeping the probability the same for each individual.
Margin of Error:
Roughly 1/samplesize.
The probability model predicts the sample result will be within 3 percentage points of the true population value.
Confidence Interval: A statistician would conclude, with 95% confidence, that between 80.6% and 86.6% of adult Americans in 2004 would have reported feeling rushed.
Meaning: When using a probability sampling method, the margin of error allows researchers to make claims about how often (in the long run, with repeated random sampling) the sample result would fall within a certain distance from the unknown population value due to chance (random sampling variation).
Bias in Non-random Samples: Non-random samples are often prone to bias, meaning the sampling method systematically over-represents some segments of the population.
Causation vs. Association
Distinction: Association (or correlation) does not equal causation.
Example: Teething Babies: When babies get their first teeth, saliva production increases, but increased saliva does not cause them to get teeth.
Cause and Effect Studies: The primary question often concerns differences between groups.
Group Formation:
Observational Studies: Researchers observe pre-existing groups (e.g., coffee drinkers vs. non-coffee drinkers).
Experimental Studies: Researchers actively form the groups themselves.
Challenge: Could observed differences be an artifact of the group-formation process, or is the difference large enough to discount chance? Is there a "fluke" in the group formation process?
Controlling for Variables in Experiments
Importance of Control: In experiments, it is crucial to control for as many variables as might affect the outcome as possible to isolate the effect of the variable of interest.
Revisiting the Infant Study-Control Measures:
Toy Color and Shape: Prior to data collection, researchers ensured each color and shape (e.g., red square, blue circle) was seen by an equal number of infants.
Handedness: Prior to data collection, researchers arranged for half the infants to see the helper toy on the right and half on the left, to account for potential right-handed tendencies.
Wooden Character Shapes: Researchers controlled for this by rotating which shape (square, triangle, circle) represented the helper, hinderer, and climber roles.
Inherent Randomness: Even with controls, there's always some inherent randomness. If the same 16 infants were re-tested, they might not make the same choices. A probability model can investigate long-term patterns if chance were the only factor.
Example: Motivation and Creativity Study (Amabile, 1985; Ramsey & Schafer, 2002)
Research Question: Does the type of motivation (intrinsic vs. extrinsic) affect creativity scores?
Subjects:47 people with extensive creative writing experience.
Procedure:
Subjects answered survey questions about either intrinsic motivations (e.g., pleasure of self-expression) or extrinsic motivations (e.g., public recognition).
All subjects then wrote a haiku.
A panel of judges evaluated the haikus for creativity (higher scores indicate more creativity).
Researchers' Conjecture: Subjects thinking about intrinsic motivations would display more creativity.
Results (Figure 2 Visual Representation): Both groups showed considerable variability, and scores had considerable overlap, meaning it's not always true that one group has higher creativity, but there might be a statistical tendency.
Psychologist Keith Stanovich (2013) refers to difficulties in thinking about probabilistic tendencies as "the Achilles heel of human cognition."
Mean Creativity Scores:
Intrinsic Group: 19.88
Extrinsic Group: 15.74 (supports the conjecture)
Considering Variability: Comparing only means is insufficient; variability must be considered.
Standard Deviation: Measures variability.
Extrinsic Group: 5.25
Intrinsic Group: 4.40
Interpretation: Most creativity scores are within about 5 points of the mean in each group.
The mean score for the intrinsic group (19.88) falls within one standard deviation of the mean score for the extrinsic group (meaning it is within the range of 15.74±5.25 which is 10.49 to 20.99). Therefore, while there's a tendency for intrinsic scores to be higher, the difference is not