9/24: SOCI 252 - Linear Regression with Binary Outcomes
Review of Previous Lecture
Predicting Final Exam Grade (Non-Binary Outcome):
Model used:
predicted final exam grade.If midterm grade is : Predicted final exam grade is . Unit of measurement is points.
If midterm grade is : Predicted final exam grade is . Unit of measurement is points.
Predicted Change: A point difference on the midterm () leads to an expected difference of points on the final exam ( or ).
Predicting Binary Outcomes: Earning an 'A'
Dataset Overview:
Same dataset as before, but the outcome variable changes.
midtermandfinalgrades are non-binary numeric.grade_a(whether a student got an 'A' or 'A-') is a binary variable ( for no 'A', for 'A').tail()function: Used to view the last six observations, similar tohead()for the first six.Interpretation of the last observation involves describing the characteristics of that specific student across all variables.
Predictor and Outcome: We are using
midtermas the predictor to determinegrade_a.Visualizing Binary Outcomes:
A histogram for
midtermgrades shows typical distribution (skewed or centered, with outliers).A histogram for
grade_a(binary) only shows two bars, representing the count of students who got an 'A' () versus those who didn't (). It is less useful for showing variation but indicates which outcome is more common (e.g., not getting an 'A' is more common in this dataset).
Interpreting the Mean for Binary Outcomes:
The mean of a binary variable (coded or ) represents the proportion of observations that have the characteristic represented by .
For
grade_a, the mean is (or 37 ext{%}).Interpretation: Approximately 37 ext{%} of students in this class ended up an 'A' or 'A-'. The remaining 63 ext{%} did not.
Units of Measurement for Binary Outcomes (Crucial Distinction)
When the outcome variable (y) is binary ( or ), the units of measurement change significantly compared to non-binary outcomes.
Unit for Average of y: Percentage (e.g., 37 ext{%} of students got an A).
Unit for Intercept ($oldsymbol{\alpha_hat}$): Percentage.
Unit for Predicted y ($oldsymbol{y_hat}$): Percentage (representing probability).
Unit for Change in y ($oldsymbol{\Delta y}$): Percentage points.
This is distinct from percentage because subtracting percentages yields percentage points. For example, a change from 4 ext{%} to 2 ext{%} is a percentage point decrease, not a 2 ext{%} decrease (which would be 2 ext{%} of 4 ext{%} or 0.08 ext{%}).
Unit for Slope ($oldsymbol{\beta_hat}$): Percentage points.
Unit for Change in $oldsymbol{y_hat}$ ($oldsymbol{\Delta y_hat}$): Percentage points.
Recap: Anytime we discuss a change in percentage, it must be expressed in percentage points to avoid confusion with a percentage of a percentage.
Scatterplot for Binary Outcomes
Appearance: With a binary outcome ( or ) on the y-axis and a continuous predictor (midterm) on the x-axis, the scatterplot looks like two horizontal lines of dots at and . It does not form a 'cloud' that a single linear line easily fits.
Interpretation of Dots: Each dot represents an individual student's midterm score and whether they got an 'A'.
Relationship (Positive/Negative):
The relationship appears positive: students with higher midterm grades tend to have a higher density of dots at (got an 'A'), while lower midterm grades have a higher density at (did not get an 'A').
Students who did not get an 'A' (y=0) show a much wider range of midterm scores, including many lower scores.
Students who got an 'A' (y=1) typically concentrated in higher midterm score ranges, with fewer low scores.
Strength of Relationship: The relationship is stronger when the two groups ( and ) are more distinct in their x-axis ranges. If their ranges overlap significantly (e.g., looking like parallel lines with lots of overlap across x-values), the relationship is weak. In this case, it's moderately strong because the 'A' group is fairly distinct from the 'no A' group in terms of typical midterm scores.
Correlation ()
Calculation: The correlation between
grade_aandmidtermis .Interpretation: A positive correlation of indicates a moderate-to-strong positive linear relationship. It means that 64 ext{%} of the variation in whether a student gets an 'A' or 'A-' is explained by their midterm grades. This is fairly high, particularly given the large range of midterm scores for students not getting an 'A' and the smaller range for those who did.
Fitting the Linear Model
Method: Using the
LM()function, similar to non-binary outcomes:LM(data$grade_a ~ data$midterm).Results:
Intercept ($oldsymbol{\alpha_hat}$):
Slope ($oldsymbol{\beta_hat}$):
Fitted Line Formula:
estimated grade_a
Interpreting Coefficients for Binary Outcomes
Intercept ($oldsymbol{\alpha_hat} = -1.34 ext{%}$):
Mathematical interpretation: If a student scored on the midterm, the predicted probability of them earning an 'A' is -134 ext{%}.
Sensibility: This is nonsensical because probabilities cannot be negative. This occurs because (a midterm score of zero) is outside the observed range of midterm grades in the dataset (the minimum observed was closer to ).
Unit: Percentage.
Despite being nonsensical in this context, the intercept is still mathematically necessary for the model and useful for calculations within the observed data range.
Slope ($oldsymbol{\beta_hat} = 0.02 ext{ percentage points}$):
Interpretation: For every one-point increase in a student's midterm exam grade, the predicted probability of them earning an 'A' or 'A-' in the class increases by percentage points, on average.
Unit: Percentage points.
Practical example: Scoring one point better on the midterm makes a student 2 ext{%} more likely to get an 'A'. Scoring one point lower makes them 2 ext{%} less likely.
Practical Predictions for Binary Outcomes
Predicting Probability of 'A' for a Midterm of :
Interpretation: A student who scored on the midterm has a 26 ext{%} probability of earning an 'A' or 'A-' in the class. (Roughly in chance).
Predicting Probability of 'A' for a Midterm of :
Interpretation: A student who scored on the midterm has a 46 ext{%} probability of earning an 'A' or 'A-' in the class. (Roughly in chance).
Predicting Change: An increase in midterm score of points () is associated with an increase in the predicted probability of earning an 'A' or 'A-' by percentage points on average ().
This is a substantial increase, but it's important to note that the midterm is only one factor contributing to a final 'A' grade.
R-squared ($oldsymbol{r^2}$) for Model Fit
Definition: R-squared is a measure of how well the regression model fits the observed data. It represents the proportion of the variance in the dependent variable (y) that is predictable from the independent variable (x).
Not related to R software: The name is similar but it's a statistical concept, not tied to the
Rprogramming language.Formula: For simple linear regression, .
Interpretation of Values:
: Indicates a perfect fit. The model explains 100 ext{%} of the variation in y. There is no error between predicted y and actual y (e.g., when correlation = or ).
: Indicates no fit. The model explains 0 ext{%} of the variation in y. There is a large mismatch between predicted y and actual y (e.g., when correlation = ).
Calculating for the 'Grade A' Model:
Correlation was .
Interpretation: The model using midterm grades explains about 41 ext{%} of the variation in whether a student gets an 'A' or 'A-'. This means midterm grades are a significant predictor, but they do not explain the majority of the variation, implying other factors are also very important.
Comparing with Non-Binary Outcome Model:
The previous model (midterm predicting final exam grade) had an .
Important Caveat: You should only compare R-squared values to other models using the same outcome variable. Different outcomes have different inherent levels of predictability, so comparing across different outcome types can be misleading.
Missing Data: When calculating correlations or with missing data, specify
na.rm = TRUE(or similar option) to exclude missing values from the calculation without permanently removing them from the dataset.
Conclusion
For predicting outcomes using linear models, aim for predictor (x) variables that are highly correlated with the outcome (y).
A higher correlation leads to a higher , indicating a better model fit and more reliable predictions from a simple linear model.
Next lecture will focus on causal effects using observational studies.