Causal Inference, Confounding, and Randomized Experiments

  • Answerable social science questions must be falsifiable.
  • Must imply a relationship (often causal).
  • Must imply an explanatory variable and an outcome variable.
  • Must imply a comparison.
  • Example of a bad question: Should Joe Biden run for re-election?
  • A better question: Does the Democratic Party have a better chance of winning reelection if Joe Biden does not run?
  • An even better question: Is Joe Biden or Kamala Harris more likely to win the 2024 presidential election?
  • Further breaking down the question into components:
    • Joe Biden versus Kamala Harris is a counterfactual.
    • Contains questions about how candidates win elections:
    1. What is the incumbent effect for presidential candidates?
    2. Do voters reward younger candidates?
    3. Is there a gender penalty for political candidates?
    4. Does candidate race influence voters?
    • Each of these questions speaks to a broader phenomenon.
    • But are falsifiable, imply a causal relationship, and a comparison.
  • Example: Does flossing reduce cavity risk?
    • Factual: Prof. Brown flosses every night.
    • Counterfactual: Prof. Brown does not floss.
    • In what situation is Prof. Brown more likely to get a cavity?
  • Example: Does social media polarize voters?
    • Is the comparison about whether using social media more/less in current equilibrium would reduce polarization?
    • Or whether a society without social media would be less polarized (historical comparison)?
    • What is the unit of analysis (individuals, political societies, etc.)?
  • Guiding questions:
    • What is the appropriate counterfactual?
    • What is the ideal comparison to consider?
  • These questions guide:
    • Data gathering
    • Research design
    • Statistical estimation
    • Assumptions
    • Considering bias and uncertainty
  • Statistics and data show associations, not causality.

Thinking in Counterfactuals and Causality

  • Thinking in counterfactuals is the motto.
  • Counterfactuals cannot be observed directly.
  • Only differences across units or within-units across time can be observed.
  • Answers to causal questions come from combining statistical estimation with design-based assumptions.
  • Associations are misleading and must be decomposed into selection and confounding.

Confounding and Selection Bias

  • Confounding and selection bias explain why correlation is not causation and why determining causality is difficult but important.
  • Association:
    • Let Yi and Ai be two variables.
    • These variables have a joint distribution Pr[Y, A].
    • Two variables are independent if one does not predict the other.
    • Conditional probability: Pr[Y = 1|A = 1] = Pr[Y = 1|A = 0].
    • Independence is written as Y \perp A.
    • If not independent, they are dependent or associated:
    • Pr[Y = 1|A = 1] \neq Pr[Y = 1|A = 0].
  • Associations are not necessarily due to causation or may distort the true causal relationship.

Examples of Associations and Causation

  • Example: High correlation between your grade and the grade of the people you work with on homework:
    • Is that because working together improves both your grades?
    • Or because you both chose to work with a smart person?
    • Probably a bit of both.
  • Place effects or residential selection:
    • Population density highly correlated with Democratic voting.
    • Some places (e.g., Massachusetts cities) are very liberal.
    • Others (e.g., Texas, rural areas) are very conservative.
  • Why?
    • Causal: Living in cities makes people more liberal.
    • Selection: Liberal people choose to live in cities.
  • Association of density and Democratic vote is a function of both selection and causal effects.
    • Selection can be really strong.
    • The causal effect is often smaller or even in the opposite direction.

Selection Bias

  • Selection is when units are more likely to be in the treated or control category in a manner that is associated with the outcome variable.
    • Can be direct (selection based on outcome).
    • Or indirect (i.e., selection based on another variable).
  • Many kinds of selection biases:
    • Selection can occur naturally:
    • Confounding variables.
    • Reverse causality.
    • Or can be induced in data generating process:
    • Endogenous sample.
    • Conditioning on a collider.

Reading: Civic Duty and Voting

  • Civic duty drives voting (Gerber, Green, Larimer).
  • Test hypotheses that sense of civic duty drives voting.
  • Naive design: sample people and ask them if they feel a sense of civic duty to vote.
  • Compare average level voting among high vs. low civic duty.
  • Problem? Confounding variable - another variable could influence civic duty and voting.
  • Creates a spurious correlation - appears to be causal but is not.

Confounders

  • A confounding variable:
    1. Effects treatment status.
    2. Effects the outcome over and above its effect on treatment status.
  • Confounders create baseline differences and thus bias.
  • Anything we haven’t measured in our estimation could be a confounder.

Reverse Causality

  • Reverse Causality: outcome affects treatment status.
    • Association between X and Y in party due to Y’s effect on X.
    • Effect of X on Y is obscured.
  • Examples:
    • Does toxic political discourse on social media make us polarized?
    • Or do we make political discourse on social media toxic because we are polarized?
  • Confounding and reverse causality are similar.
  • Does campaign spending win elections?
    • Incumbents who spend more have lower vote share.
    • Reverse causality: lower vote share → spending.
    • Confounding: Incumbent unpopular on issues so spends more (to try to persuade voters) and has lower vote share.
    • Both categories model the same problem:
    • Issue is either lower vote share or the confounders that lead to lower vote share.

Endogenous Sample

  • Another reason to be suspicious of associations is often they are estimated from data where being in the sample is a function of the explanatory variable.
  • For example, are the police more likely to use violence against racial minorities?
    • Most studies of this rely on data on what police do after making a stop.
    • But race of potential stopee may influence if the police stops them at all.
    • Induces bias because sample is selected on a post-treatment or endogenous outcome.

Selecting on the Dependent Variable

  • One version of an endogenous sample is selecting on the dependent variable.
  • If we want to know what wins elections, should we look at just the winners, or all candidates?
  • If we look at only winners and see all winners went to college, do we conclude a college degree is what wins elections?
  • Did the losers also go to college?

Conditioning on Colliders

  • Another version of an endogenous sample is when we induce an association by conditioning on a collider.
    • Collider: a variable influenced by two other variables, and these two other variables do not influence each other.
    • No relationship between A and Y.
    • But if we condition on X we unblock the path of association.
  • Example: A is getting the flu, and Y is getting hit by a bus.
    • Are A and Y related? No.
    • Both might cause us to be in the hospital.
    • Knowing that I have the flu doesn’t give me any information about whether or not I’ve been hit by a bus.
    • But if we only look at people in the hospital, suddenly we get a negative association.

Randomized Experiments in Political Science

  • Randomization and Identification
  • The fundamental problem of causal inference
    • We cannot observe what happened (