Causal Inference, Confounding, and Randomized Experiments

Answerable social science questions must be falsifiable.
Must imply a relationship (often causal).
Must imply an explanatory variable and an outcome variable.
Must imply a comparison.
Example of a bad question: Should Joe Biden run for re-election?
A better question: Does the Democratic Party have a better chance of winning reelection if Joe Biden does not run?
An even better question: Is Joe Biden or Kamala Harris more likely to win the 2024 presidential election?
Further breaking down the question into components:
- Joe Biden versus Kamala Harris is a counterfactual.
- Contains questions about how candidates win elections:
1. What is the incumbent effect for presidential candidates?
2. Do voters reward younger candidates?
3. Is there a gender penalty for political candidates?
4. Does candidate race influence voters?
- Each of these questions speaks to a broader phenomenon.
- But are falsifiable, imply a causal relationship, and a comparison.
Example: Does flossing reduce cavity risk?
- Factual: Prof. Brown flosses every night.
- Counterfactual: Prof. Brown does not floss.
- In what situation is Prof. Brown more likely to get a cavity?
Example: Does social media polarize voters?
- Is the comparison about whether using social media more/less in current equilibrium would reduce polarization?
- Or whether a society without social media would be less polarized (historical comparison)?
- What is the unit of analysis (individuals, political societies, etc.)?
Guiding questions:
- What is the appropriate counterfactual?
- What is the ideal comparison to consider?
These questions guide:
- Data gathering
- Research design
- Statistical estimation
- Assumptions
- Considering bias and uncertainty
Statistics and data show associations, not causality.

Thinking in Counterfactuals and Causality

Thinking in counterfactuals is the motto.
Counterfactuals cannot be observed directly.
Only differences across units or within-units across time can be observed.
Answers to causal questions come from combining statistical estimation with design-based assumptions.
Associations are misleading and must be decomposed into selection and confounding.

Confounding and Selection Bias

Confounding and selection bias explain why correlation is not causation and why determining causality is difficult but important.
Association:
- Let Yi and Ai be two variables.
- These variables have a joint distribution Pr[Y, A].
- Two variables are independent if one does not predict the other.
- Conditional probability: Pr[Y = 1|A = 1] = Pr[Y = 1|A = 0].
- Independence is written as Y \perp A.
- If not independent, they are dependent or associated:
- Pr[Y = 1|A = 1] \neq Pr[Y = 1|A = 0].
Associations are not necessarily due to causation or may distort the true causal relationship.

Examples of Associations and Causation

Example: High correlation between your grade and the grade of the people you work with on homework:
- Is that because working together improves both your grades?
- Or because you both chose to work with a smart person?
- Probably a bit of both.
Place effects or residential selection:
- Population density highly correlated with Democratic voting.
- Some places (e.g., Massachusetts cities) are very liberal.
- Others (e.g., Texas, rural areas) are very conservative.
Why?
- Causal: Living in cities makes people more liberal.
- Selection: Liberal people choose to live in cities.
Association of density and Democratic vote is a function of both selection and causal effects.
- Selection can be really strong.
- The causal effect is often smaller or even in the opposite direction.

Selection Bias

Selection is when units are more likely to be in the treated or control category in a manner that is associated with the outcome variable.
- Can be direct (selection based on outcome).
- Or indirect (i.e., selection based on another variable).
Many kinds of selection biases:
- Selection can occur naturally:
- Confounding variables.
- Reverse causality.
- Or can be induced in data generating process:
- Endogenous sample.
- Conditioning on a collider.

Reading: Civic Duty and Voting

Civic duty drives voting (Gerber, Green, Larimer).
Test hypotheses that sense of civic duty drives voting.
Naive design: sample people and ask them if they feel a sense of civic duty to vote.
Compare average level voting among high vs. low civic duty.
Problem? Confounding variable - another variable could influence civic duty and voting.
Creates a spurious correlation - appears to be causal but is not.

Confounders

A confounding variable:
1. Effects treatment status.
2. Effects the outcome over and above its effect on treatment status.
Confounders create baseline differences and thus bias.
Anything we haven’t measured in our estimation could be a confounder.

Reverse Causality

Reverse Causality: outcome affects treatment status.
- Association between X and Y in party due to Y’s effect on X.
- Effect of X on Y is obscured.
Examples:
- Does toxic political discourse on social media make us polarized?
- Or do we make political discourse on social media toxic because we are polarized?
Confounding and reverse causality are similar.
Does campaign spending win elections?
- Incumbents who spend more have lower vote share.
- Reverse causality: lower vote share → spending.
- Confounding: Incumbent unpopular on issues so spends more (to try to persuade voters) and has lower vote share.
- Both categories model the same problem:
- Issue is either lower vote share or the confounders that lead to lower vote share.

Endogenous Sample

Another reason to be suspicious of associations is often they are estimated from data where being in the sample is a function of the explanatory variable.
For example, are the police more likely to use violence against racial minorities?
- Most studies of this rely on data on what police do after making a stop.
- But race of potential stopee may influence if the police stops them at all.
- Induces bias because sample is selected on a post-treatment or endogenous outcome.

Selecting on the Dependent Variable

One version of an endogenous sample is selecting on the dependent variable.
If we want to know what wins elections, should we look at just the winners, or all candidates?
If we look at only winners and see all winners went to college, do we conclude a college degree is what wins elections?
Did the losers also go to college?

Conditioning on Colliders

Another version of an endogenous sample is when we induce an association by conditioning on a collider.
- Collider: a variable influenced by two other variables, and these two other variables do not influence each other.
- No relationship between A and Y.
- But if we condition on X we unblock the path of association.
Example: A is getting the flu, and Y is getting hit by a bus.
- Are A and Y related? No.
- Both might cause us to be in the hospital.
- Knowing that I have the flu doesn’t give me any information about whether or not I’ve been hit by a bus.
- But if we only look at people in the hospital, suddenly we get a negative association.

Randomized Experiments in Political Science

Randomization and Identification
The fundamental problem of causal inference
- We cannot observe what happened (