Non-Linear Associations in OLS Models and Linear Probability Models

Introduction to Multicollinearity

Definition of Multicollinearity: Multicollinearity occurs when two or more independent variables in a regression model are highly correlated, meaning they move together in a linear fashion.
Consequences: This high correlation can severely complicate the accurate estimation of individual coefficient effects in regression models:
- If independent variables are perfectly correlated (e.g., correlation coefficient of $r = \pm 1$ ), knowing the value of one variable perfectly predicts the value of the other. In such cases, the software cannot compute unique estimates for their effects simultaneously, rendering the model inestimable if both are included. One of the perfectly correlated variables must be excluded.
- In social sciences, perfect correlation often indicates that the variables are essentially measuring the same underlying concept but using different scales or slightly different operationalizations.
Strong Correlation: Even if the correlation is not perfect but merely strong (e.g., r > 0.8 or r < -0.8), highly correlated variables can lead to several adverse outcomes:
- Inflated Standard Errors: Standard errors for the affected coefficients become larger than they would be if the variables were less correlated. This is because it becomes difficult for the model to isolate the unique contribution of each variable.
- Reduced Statistical Significance: Inflated standard errors make it more difficult to achieve statistical significance (i.e., obtain small p-values), even if a true and meaningful association between the independent variable and the dependent variable exists. This increases the risk of Type II errors (failing to detect a true effect).
- Unstable Coefficient Estimates: The estimated coefficients can become highly sensitive to small changes in the model specification or the data. Adding or removing even a few observations can cause coefficients to shift dramatically, making them unreliable and difficult to interpret.

Example of Multicollinearity Issues

Variables in a Regression: Consider an outcome variable (Y) of income and two independent variables age (X1) and education (X2). The OLS regression equation would typically be: $\text{Income}i = \beta0 + \beta1\text{Age}i + \beta2\text{Education}i + \epsilon_i$
- The coefficient for age ( $\beta1$ ) in this model represents the expected change in income for a one-unit increase in age, while holding education constant. Similarly, the coefficient for education ( $\beta2$ ) represents the expected change in income for a one-unit increase in education, while holding age constant.
- If age and education are highly correlated (e.g., older individuals tend to have more years of education due to historical trends or career progression), it becomes challenging for the model to disentangle their separate effects. The estimates for their coefficients ( $\beta1$ and $\beta2$ ) become unreliable, and their standard errors increase significantly. This can potentially lead to non-significant results even if there is a genuine, albeit overlapping, effect of both variables on income.

Case Study: Commute Time and Test Scores

Research Scenario: Analyzing the relationship between commute time and test scores in a rural school district.
- Dependent Variable: Test scores (Y).
- Independent Variables: Commute time (X1, measured in minutes) and distance between home and school (X2, measured in miles).
- These two variables are almost certainly highly correlated. For most students, a longer commute time will directly correspond to a greater distance between their home and school. Including both in the same regression model would create severe multicollinearity, leading to biased and unstable estimates for their individual effects.
Solution: To avoid multicollinearity problems in the regression analysis, it is prudent to select only one of these variables. The choice between commute time and distance should be guided by theoretical considerations (which variable is more directly presumed to affect test scores) and practical availability/quality of data.

Robustness of Regression Models

Sample Size Mitigation: With sufficiently large sample sizes (e.g., generally over 2000 observations), the impact of multicollinearity on the reliability of coefficient estimates can be somewhat mitigated. This is because a larger number of observations provides more unique variation, making it easier for the model to discern even subtle individual contributions of correlated predictors.
Conceptual Clarity: However, it remains crucial to avoid including variables that are conceptually redundant or essentially measure the exact same construct, regardless of sample size. For instance, including both GPA and cumulative academic average is often problematic. Diagnostic tools like the Variance Inflation Factor (VIF) can be used to detect multicollinearity, with VIF values above 5 or 10 generally indicating a problem. Even with large samples, high VIFs suggest that the individual coefficients are still much less precise than they could be.

Centering Variables Around the Mean

Purpose of Centering: Centering variables involves subtracting the mean of a variable from each of its observations. This practice serves two primary purposes:
- Interpretable Intercept: It can make the intercept (the value of Y when all X variables are 0) more meaningful, especially if 0 is an impossible or nonsensical value for an independent variable (e.g., age 0).
- Reduced Correlation for Interaction/Polynomial Terms: While it does not resolve multicollinearity between independent variables themselves, centering can significantly reduce collinearity between a main effect and an interaction term involving that main effect, or between a linear term and its quadratic term (e.g., age and age^2).
Centering Process: To center a variable, you subtract its mean from each observation. For example, to center age around its mean:
- If the mean age in your sample is 49: Centered age ( $X_{centered}$ ) = Original age (X) - 49.
Regression Impact: When variables are centered, the slopes (coefficients) of the independent variables in the regression model remain exactly the same as they would be with uncentered variables. However, the intercept changes. Its interpretation shifts from the predicted value of Y when X is 0 to the predicted value of Y when X is at its mean.
- Example Regression: Suppose an original regression yields: $\text{Predicted Income} = 17686 + 500 \times \text{Age}$
- Using original age: The intercept of $17686 represents the predicted income at age 0, which is not meaningful or within the plausible range of the data.
- Using centered age (where mean age is 50): If the regression is re-estimated with centered age, the coefficient for centered age will still be $500, but the intercept might be $36511. This means the predicted income at the mean age of 50 is $36511, providing a much more interpretable baseline.

Non-Linear Relationships in Regression

OLS Assumptions: Ordinary Least Squares (OLS) regression fundamentally assumes a linear relationship between the dependent variable (Y) and each independent variable (X). If the true relationship is non-linear but modeled linearly, the OLS estimates will be biased and inefficient, leading to incorrect inferences.
Incorporating Non-Linearity: To allow for non-linear effects, explicit terms can be added to the model. A common approach is to include a quadratic term (e.g., include both age and age^2 in the model). The quadratic term is typically generated by squaring the original or, more commonly, the centered independent variable to reduce multicollinearity between the linear and quadratic terms.
- The regression equation becomes: $\text{Y}i = \beta0 + \beta1\text{X}i + \beta2\text{X}^2i + \epsilon_i$
- The interpretation of $\beta1$ now changes, as the effect of X on Y is no longer constant but rather dependent on the value of X itself ( $\partial Y / \partial X = \beta1 + 2\beta_2 X$ ).
Graphing Non-Linear Relationships: Including quadratic terms results in a curved relationship (a parabola), which can be either concave (opening downwards, peaking) or convex (opening upwards, U-shaped), rather than a straight line. This reflects that the effect of X on Y changes in magnitude and possibly direction based on the current value of X.

Example of Non-Linear Regression Model

Income as Function of Age: The relationship between age and income is a classic example of a non-linear association. It's plausible that as age increases, income might rise steadily during early career stages, then accelerate, potentially peak around mid-career (e.g., around age 50-60), and subsequently plateau or even decrease during later career or retirement years. Such a pattern cannot be accurately captured by a simple linear term for age.
- A model including both age and age^2 can effectively capture this curvilinear pattern. The coefficient for age would represent the initial slope, while the coefficient for age^2 would indicate the rate at which this slope changes, allowing for the income trajectory to rise, peak, and potentially fall.

Use of Dummy Variables to Address Non-linearity

Creating Categorical Variables: Instead of forcing a polynomial functional form (like quadratic), researchers can create a set of dummy (binary) variables for different categories or ranges of a continuous independent variable. For instance, age could be categorized into age_group_20-29, age_group_30-39, age_group_40-49, etc., with one group chosen as the reference category.
Implications: Using dummy variables captures non-linear trends by estimating a distinct effect for each category relative to the reference group, without forcing a specific mathematical shape (like a parabola) on the data. This offers flexibility but increases the number of parameters to estimate, potentially reducing statistical power if categories are too fine-grained or sample size is small. The coefficient for each dummy variable represents the average difference in the dependent variable between that specific group and the reference group, holding other variables constant.

Testing for Non-Linear Relationships

When to consider non-linear relationships: The decision to model non-linearity should be guided by:
- Existing Theory: Strong theoretical reasons suggesting a non-linear association (e.g., economic diminishing returns, sociological concepts of age and social relationships peaking, psychological models of stress and performance). For instance, U-shaped relationship between stress and performance.
- Visual Inspection: Scatter plots of the dependent variable against independent variables can reveal obvious curvilinear patterns.
- Formal Statistical Tests: Tests like the Ramsey RESET test can formally evaluate whether non-linear terms (like powers of predicted values) are needed in the model. Comparing AIC or BIC values, or using F-tests to compare nested models (with and without non-linear terms), can also help.

Types of Regression Variables Covered

Ordinary Least Squares (OLS): Primarily focuses on modeling relationships when the dependent variable is continuous, unbounded, and approximately normally distributed (or large sample sizes allow for Central Limit Theorem assumptions for estimators).
Other Models: OLS is just one of many regression techniques, each suited for different types of dependent variables:
- Logistic Regression: Used for binary outcomes (e.g., 0/1, yes/no, success/failure). It models the probability of an event occurring using a logit link function.
- Ordered Logistic Regression: Designed for ordinal outcomes, where the categories have a meaningful order but uneven intervals (e.g., strongly disagree, disagree, neutral, agree, strongly agree).
- Multinomial Regression: Used for nominal outcomes, where there are more than two categories without any inherent order (e.g., political party affiliation, mode of transportation).
- Poisson Regression: Appropriate for count outcomes, which are non-negative integers (e.g., number of doctor visits, number of arrests), especially when the data exhibit a Poisson distribution.
Linear Probability Model (LPM): This involves using OLS regression when the dependent variable is binary (0 or 1). While simple and easy to interpret, it has significant limitations that make its use debatable:
- Predicted Probabilities Outside [0,1]: The model can predict probabilities less than 0 or greater than 1, which are theoretically impossible for probabilities.
- Constant Marginal Effects: The effect of a one-unit change in an independent variable on the probability of the outcome is assumed to be constant across all values of the independent variables. This is often unrealistic; for example, going from a very low to a moderate income might have a different impact on the probability of buying a house than going from a high to a very high income.
- Heteroskedasticity: The error terms in an LPM are inherently heteroskedastic, violating an OLS assumption and leading to inefficient (though still unbiased) standard errors. Robust standard errors are typically required to address this.

Case Example of a Linear Probability Model

Dependent Variable: A binary outcome representing support for a law (e.g., 1 = favor, 0 = oppose). The independent variables could include education level, income, age, etc.
Interpretation of Coefficients: For example, if the coefficient for a college degree dummy variable in an LPM is $0.07$ , it means individuals with a college degree are estimated to be 7 percentage points (or 7%) more likely to support gun permit laws compared to those without a college degree, holding other factors constant. This interpretation is straightforward, but it's crucial to remember the inherent limitations of the LPM, such as the possibility of predicted probabilities falling outside the $[0, 1]$ range, especially when dealing with extreme values of predictors or a dependent variable with probabilities close to 0 or 1. While LPM provides an intuitive interpretation of 'percent change in probability', its conceptual drawbacks lead many researchers to prefer logistic or probit models for binary outcomes.
Overall Aim: The ultimate goal is to understand and appropriately employ various regression techniques to accurately capture relationships in social science research, carefully considering the types of variables, the functional form of relationships (linear vs. non