9/26: SOCI 252 - Linear Regression: From Prediction to Explanation and Causal Inference

Engineered Materials and Structural Design

Some engineered designs feature materials that are designed to collapse upon specific impact, yet remain extremely strong until that specific condition is met.
- Example: Street signs are designed to crumple around a car upon impact to mitigate damage to the vehicle and its occupants.

Course Update and Introduction to Explanatory Analysis

Midterm Follow-up: A brief 'baseball test' was offered instead of a quiz, serving as a focus opportunity.
Previous Topics: On Monday and Wednesday, the class discussed linear regression as a tool for prediction.
Current Focus (Chapter 5): The course is now shifting to linear regression as a tool for explanation, focusing on understanding possible effects within observational data rather than experimental data.
- This means using the same tool (linear regression) for a slightly different purpose: explaining the effect of one variable on another.

Causation and Counterfactuals

Counterfactuals: It is impossible to observe the counterfactual for an individual (what would have happened if they had not received treatment).
- To infer causation, groups assumed to be similar are compared, with the only difference being the presence or absence of a 'treatment'.

Randomized Experiments

Core Principle: Rely on random assignment of treatment to ensure groups are comparable on average, in terms of both observed and unobserved pretreatment characteristics.
Estimating Treatment Effect: In randomized experiments, the average treatment effect can be simply estimated by calculating the difference between the means of the treatment and control groups.
- This difference in means provides a valid estimate because randomization ensures the groups are comparable.

Observational Data

Nature: Most real-world data is observational, not experimental, making it much 'messier'.
Challenges of Experiments: Randomized experiments are often difficult, unethical, or impossible to conduct (e.g., studying life course outcomes, where one cannot control every facet of an individual's life).
Definition: Observational data is collected from naturally occurring events (e.g., surveys, direct observation, physical measurements like temperature).
Lack of Control: Unlike experiments, observational studies do not involve separating groups into controlled treatment and control conditions. Therefore, one cannot assume group comparability.
Metaphor:
- Experimental Data: Like two goldfish in separate aquariums, where every aspect (water temperature, environment, food) can be meticulously controlled, allowing for isolation of a single variable's effect.
- Observational Data: Like looking at the ocean, vast with many fish and complex variables that cannot be controlled or isolated for each individual subject. Data is richer, but control is lost.

Confounding Variables (Confounders)

Definition: In observational data, groups are not comparable due to 'relevant differences' called confounding variables or confounders.
- A confounding variable ( $Z$ ) affects both the likelihood of receiving the treatment ( $X$ ) and the outcome ( $Y$ ).
- Diagrammatic Representation: $Z \rightarrow X \text{ and } Z \rightarrow Y$ , where $X$ is also related to $Y$ . This means the observed relationship between $X$ and $Y$ might be due to $Z$ . (This is a simplified representation of the causal diagram discussed).
Importance: These variables must be identified and controlled for to make groups comparable and draw valid conclusions.
Examples:
- Private Schools and Test Scores: Observing that private school students have higher test scores than public school students. A confounder could be parents' income (class).
  - Parents' income affects both the likelihood of attending private school and potentially influences test scores (e.g., through resources, tutoring, home environment).
  - Without controlling for income, one might mistakenly attribute better test scores directly to private schooling rather than the underlying socioeconomic factor.
- Ice Cream Sales and Drowning Deaths: A monthly plot might show ice cream sales increasing in April/May, followed by an increase in drowning deaths in May/June, suggesting a lagged causal relationship.
  - This is a classic example where correlation does not imply causation.
  - The confounder is average daily temperature (heat). Hot weather (summer) leads to more ice cream sales and more people swimming.
  - More people swimming, by virtue of numbers, leads to more drowning deaths, regardless of ice cream consumption.
  - To test this, one would control for average monthly temperature in a linear regression model.

Addressing Confounders

Control for Confounders: In the presence of confounders, simply looking at the difference in means between groups (as in experiments) is inadequate for a valid causal estimate.
Why Not a Concern in Randomized Experiments? Randomization breaks the link between potential confounders and treatment assignment.
- Hypothetical Example: To study the causal effect of private schools, a lottery system for admission would randomly assign students to private or public schools, thus breaking the link between familial wealth and private school attendance. This makes the groups comparable.
Requirement in Observational Data: Because random assignment is absent, all relevant confounding variables must be identified and controlled for in statistical models (e.g., using multiple linear regression).

Linear Regression for Experimental Data

Versatility: Linear regression is a highly versatile tool for working with data.
Equivalence to Difference-in-Means: For a binary treatment variable in a randomized experiment, fitting a simple linear regression model where the outcome variable ( $Y$ ) is regressed on the treatment variable ( $X$ ) yields a coefficient ( $\beta_1$ or $ext{beta hat}$ , $\hat{\beta}$ ) that is equivalent to the difference-in-means estimator.
- Example (Social Pressure Experiment):
  - The 'message' variable (yes/no) was recoded to a binary 'pressure' variable ( $0$ for no, $1$ for yes).
  - Difference-in-means estimator: Approximately $8$ percentage points.
  - Linear regression model: LM(voted ~ pressure)
  - Resulting equation: $ext{Voted} = 0.29664 + 0.08131 \times ext{Pressure}$ .
  - The coefficient for pressure ( $\hat{\beta} = 0.08131$ ) is numerically the same as the difference-in-means estimator.
Interpretation of $\hat{\beta}$ (When X is a Treatment and Y is the Outcome):
- It represents the average change in the outcome's probability when the treatment status changes from the control group ( $X=0$ ) to the treatment group ( $X=1$ ).
- Units: If the outcome ( $Y$ ) is binary (e.g., voted/not voted), $\hat{\beta}$ is interpreted in percentage points.
- Example Interpretation: Receiving the social pressure message (changing pressure from $0$ to $1$ ) is associated with an $8$ percentage point predicted increase in the probability of voting.
Causal vs. Predictive Language:
- Predictive Language: Uses phrases like