Causal Inference, Confounding, and Randomized Experiments
- Answerable social science questions must be falsifiable.
- Must imply a relationship (often causal).
- Must imply an explanatory variable and an outcome variable.
- Must imply a comparison.
- Example of a bad question: Should Joe Biden run for re-election?
- A better question: Does the Democratic Party have a better chance of winning reelection if Joe Biden does not run?
- An even better question: Is Joe Biden or Kamala Harris more likely to win the 2024 presidential election?
- Further breaking down the question into components:
- Joe Biden versus Kamala Harris is a counterfactual.
- Contains questions about how candidates win elections:
- What is the incumbent effect for presidential candidates?
- Do voters reward younger candidates?
- Is there a gender penalty for political candidates?
- Does candidate race influence voters?
- Each of these questions speaks to a broader phenomenon.
- But are falsifiable, imply a causal relationship, and a comparison.
- Example: Does flossing reduce cavity risk?
- Factual: Prof. Brown flosses every night.
- Counterfactual: Prof. Brown does not floss.
- In what situation is Prof. Brown more likely to get a cavity?
- Example: Does social media polarize voters?
- Is the comparison about whether using social media more/less in current equilibrium would reduce polarization?
- Or whether a society without social media would be less polarized (historical comparison)?
- What is the unit of analysis (individuals, political societies, etc.)?
- Guiding questions:
- What is the appropriate counterfactual?
- What is the ideal comparison to consider?
- These questions guide:
- Data gathering
- Research design
- Statistical estimation
- Assumptions
- Considering bias and uncertainty
- Statistics and data show associations, not causality.
Thinking in Counterfactuals and Causality
- Thinking in counterfactuals is the motto.
- Counterfactuals cannot be observed directly.
- Only differences across units or within-units across time can be observed.
- Answers to causal questions come from combining statistical estimation with design-based assumptions.
- Associations are misleading and must be decomposed into selection and confounding.
Confounding and Selection Bias
- Confounding and selection bias explain why correlation is not causation and why determining causality is difficult but important.
- Association:
- Let Yi and Ai be two variables.
- These variables have a joint distribution Pr[Y, A].
- Two variables are independent if one does not predict the other.
- Conditional probability: Pr[Y = 1|A = 1] = Pr[Y = 1|A = 0].
- Independence is written as Y \perp A.
- If not independent, they are dependent or associated:
- Pr[Y = 1|A = 1] \neq Pr[Y = 1|A = 0].
- Associations are not necessarily due to causation or may distort the true causal relationship.
Examples of Associations and Causation
- Example: High correlation between your grade and the grade of the people you work with on homework:
- Is that because working together improves both your grades?
- Or because you both chose to work with a smart person?
- Probably a bit of both.
- Place effects or residential selection:
- Population density highly correlated with Democratic voting.
- Some places (e.g., Massachusetts cities) are very liberal.
- Others (e.g., Texas, rural areas) are very conservative.
- Why?
- Causal: Living in cities makes people more liberal.
- Selection: Liberal people choose to live in cities.
- Association of density and Democratic vote is a function of both selection and causal effects.
- Selection can be really strong.
- The causal effect is often smaller or even in the opposite direction.
Selection Bias
- Selection is when units are more likely to be in the treated or control category in a manner that is associated with the outcome variable.
- Can be direct (selection based on outcome).
- Or indirect (i.e., selection based on another variable).
- Many kinds of selection biases:
- Selection can occur naturally:
- Confounding variables.
- Reverse causality.
- Or can be induced in data generating process:
- Endogenous sample.
- Conditioning on a collider.
Reading: Civic Duty and Voting
- Civic duty drives voting (Gerber, Green, Larimer).
- Test hypotheses that sense of civic duty drives voting.
- Naive design: sample people and ask them if they feel a sense of civic duty to vote.
- Compare average level voting among high vs. low civic duty.
- Problem? Confounding variable - another variable could influence civic duty and voting.
- Creates a spurious correlation - appears to be causal but is not.
Confounders
- A confounding variable:
- Effects treatment status.
- Effects the outcome over and above its effect on treatment status.
- Confounders create baseline differences and thus bias.
- Anything we haven’t measured in our estimation could be a confounder.
Reverse Causality
- Reverse Causality: outcome affects treatment status.
- Association between X and Y in party due to Y’s effect on X.
- Effect of X on Y is obscured.
- Examples:
- Does toxic political discourse on social media make us polarized?
- Or do we make political discourse on social media toxic because we are polarized?
- Confounding and reverse causality are similar.
- Does campaign spending win elections?
- Incumbents who spend more have lower vote share.
- Reverse causality: lower vote share → spending.
- Confounding: Incumbent unpopular on issues so spends more (to try to persuade voters) and has lower vote share.
- Both categories model the same problem:
- Issue is either lower vote share or the confounders that lead to lower vote share.
Endogenous Sample
- Another reason to be suspicious of associations is often they are estimated from data where being in the sample is a function of the explanatory variable.
- For example, are the police more likely to use violence against racial minorities?
- Most studies of this rely on data on what police do after making a stop.
- But race of potential stopee may influence if the police stops them at all.
- Induces bias because sample is selected on a post-treatment or endogenous outcome.
Selecting on the Dependent Variable
- One version of an endogenous sample is selecting on the dependent variable.
- If we want to know what wins elections, should we look at just the winners, or all candidates?
- If we look at only winners and see all winners went to college, do we conclude a college degree is what wins elections?
- Did the losers also go to college?
Conditioning on Colliders
- Another version of an endogenous sample is when we induce an association by conditioning on a collider.
- Collider: a variable influenced by two other variables, and these two other variables do not influence each other.
- No relationship between A and Y.
- But if we condition on X we unblock the path of association.
- Example: A is getting the flu, and Y is getting hit by a bus.
- Are A and Y related? No.
- Both might cause us to be in the hospital.
- Knowing that I have the flu doesn’t give me any information about whether or not I’ve been hit by a bus.
- But if we only look at people in the hospital, suddenly we get a negative association.
Randomized Experiments in Political Science
- Randomization and Identification
- The fundamental problem of causal inference
- We cannot observe what happened (