9/24: SOCI 252 - Linear Regression with Binary Outcomes

Review of Previous Lecture
  • Predicting Final Exam Grade (Non-Binary Outcome):

    • Model used: predicted final exam grade =6.01+0.97imesextmidterm= -6.01 + 0.97 imes ext{midterm}.

    • If midterm grade is 8080: Predicted final exam grade is 0.97imes806.01=77.66.01=71.590.97 imes 80 - 6.01 = 77.6 - 6.01 = 71.59. Unit of measurement is points.

    • If midterm grade is 9090: Predicted final exam grade is 0.97imes906.01=87.36.01=81.290.97 imes 90 - 6.01 = 87.3 - 6.01 = 81.29. Unit of measurement is points.

    • Predicted Change: A 1010 point difference on the midterm (908090 - 80) leads to an expected difference of 9.79.7 points on the final exam (81.2971.5981.29 - 71.59 or 0.97imes100.97 imes 10).

Predicting Binary Outcomes: Earning an 'A'
  • Dataset Overview:

    • Same dataset as before, but the outcome variable changes.

    • midterm and final grades are non-binary numeric.

    • grade_a (whether a student got an 'A' or 'A-') is a binary variable (00 for no 'A', 11 for 'A').

    • tail() function: Used to view the last six observations, similar to head() for the first six.

    • Interpretation of the last observation involves describing the characteristics of that specific student across all variables.

  • Predictor and Outcome: We are using midterm as the predictor to determine grade_a.

  • Visualizing Binary Outcomes:

    • A histogram for midterm grades shows typical distribution (skewed or centered, with outliers).

    • A histogram for grade_a (binary) only shows two bars, representing the count of students who got an 'A' (11) versus those who didn't (00). It is less useful for showing variation but indicates which outcome is more common (e.g., not getting an 'A' is more common in this dataset).

  • Interpreting the Mean for Binary Outcomes:

    • The mean of a binary variable (coded 00 or 11) represents the proportion of observations that have the characteristic represented by 11.

    • For grade_a, the mean is 0.3680.368 (or 37 ext{%}).

    • Interpretation: Approximately 37 ext{%} of students in this class ended up an 'A' or 'A-'. The remaining 63 ext{%} did not.

Units of Measurement for Binary Outcomes (Crucial Distinction)
  • When the outcome variable (y) is binary (00 or 11), the units of measurement change significantly compared to non-binary outcomes.

  • Unit for Average of y: Percentage (e.g., 37 ext{%} of students got an A).

  • Unit for Intercept ($oldsymbol{\alpha_hat}$): Percentage.

  • Unit for Predicted y ($oldsymbol{y_hat}$): Percentage (representing probability).

  • Unit for Change in y ($oldsymbol{\Delta y}$): Percentage points.

    • This is distinct from percentage because subtracting percentages yields percentage points. For example, a change from 4 ext{%} to 2 ext{%} is a 22 percentage point decrease, not a 2 ext{%} decrease (which would be 2 ext{%} of 4 ext{%} or 0.08 ext{%}).

  • Unit for Slope ($oldsymbol{\beta_hat}$): Percentage points.

  • Unit for Change in $oldsymbol{y_hat}$ ($oldsymbol{\Delta y_hat}$): Percentage points.

    • Recap: Anytime we discuss a change in percentage, it must be expressed in percentage points to avoid confusion with a percentage of a percentage.

Scatterplot for Binary Outcomes
  • Appearance: With a binary outcome (00 or 11) on the y-axis and a continuous predictor (midterm) on the x-axis, the scatterplot looks like two horizontal lines of dots at y=0y=0 and y=1y=1. It does not form a 'cloud' that a single linear line easily fits.

  • Interpretation of Dots: Each dot represents an individual student's midterm score and whether they got an 'A'.

  • Relationship (Positive/Negative):

    • The relationship appears positive: students with higher midterm grades tend to have a higher density of dots at y=1y=1 (got an 'A'), while lower midterm grades have a higher density at y=0y=0 (did not get an 'A').

    • Students who did not get an 'A' (y=0) show a much wider range of midterm scores, including many lower scores.

    • Students who got an 'A' (y=1) typically concentrated in higher midterm score ranges, with fewer low scores.

  • Strength of Relationship: The relationship is stronger when the two groups (y=0y=0 and y=1y=1) are more distinct in their x-axis ranges. If their ranges overlap significantly (e.g., looking like parallel lines with lots of overlap across x-values), the relationship is weak. In this case, it's moderately strong because the 'A' group is fairly distinct from the 'no A' group in terms of typical midterm scores.

Correlation (rr)
  • Calculation: The correlation between grade_a and midterm is 0.640.64.

  • Interpretation: A positive correlation of 0.640.64 indicates a moderate-to-strong positive linear relationship. It means that 64 ext{%} of the variation in whether a student gets an 'A' or 'A-' is explained by their midterm grades. This is fairly high, particularly given the large range of midterm scores for students not getting an 'A' and the smaller range for those who did.

Fitting the Linear Model
  • Method: Using the LM() function, similar to non-binary outcomes: LM(data$grade_a ~ data$midterm).

  • Results:

    • Intercept ($oldsymbol{\alpha_hat}$): 1.34-1.34

    • Slope ($oldsymbol{\beta_hat}$): 0.020.02

  • Fitted Line Formula: estimated grade_a =1.34+0.02imesextmidterm= -1.34 + 0.02 imes ext{midterm}

Interpreting Coefficients for Binary Outcomes
  • Intercept ($oldsymbol{\alpha_hat} = -1.34 ext{%}$):

    • Mathematical interpretation: If a student scored 00 on the midterm, the predicted probability of them earning an 'A' is -134 ext{%}.

    • Sensibility: This is nonsensical because probabilities cannot be negative. This occurs because x=0x=0 (a midterm score of zero) is outside the observed range of midterm grades in the dataset (the minimum observed was closer to 4040).

    • Unit: Percentage.

    • Despite being nonsensical in this context, the intercept is still mathematically necessary for the model and useful for calculations within the observed data range.

  • Slope ($oldsymbol{\beta_hat} = 0.02 ext{ percentage points}$):

    • Interpretation: For every one-point increase in a student's midterm exam grade, the predicted probability of them earning an 'A' or 'A-' in the class increases by 22 percentage points, on average.

    • Unit: Percentage points.

    • Practical example: Scoring one point better on the midterm makes a student 2 ext{%} more likely to get an 'A'. Scoring one point lower makes them 2 ext{%} less likely.

Practical Predictions for Binary Outcomes
  • Predicting Probability of 'A' for a Midterm of 8080:

    • extPredictedGradeA=1.34+(0.02imes80)=1.34+1.6=0.26ext{Predicted Grade A} = -1.34 + (0.02 imes 80) = -1.34 + 1.6 = 0.26

    • Interpretation: A student who scored 8080 on the midterm has a 26 ext{%} probability of earning an 'A' or 'A-' in the class. (Roughly 11 in 44 chance).

  • Predicting Probability of 'A' for a Midterm of 9090:

    • extPredictedGradeA=1.34+(0.02imes90)=1.34+1.8=0.46ext{Predicted Grade A} = -1.34 + (0.02 imes 90) = -1.34 + 1.8 = 0.46

    • Interpretation: A student who scored 9090 on the midterm has a 46 ext{%} probability of earning an 'A' or 'A-' in the class. (Roughly 11 in 22 chance).

  • Predicting Change: An increase in midterm score of 1010 points (908090 - 80) is associated with an increase in the predicted probability of earning an 'A' or 'A-' by 2020 percentage points on average (0.02imes10=0.200.02 imes 10 = 0.20).

    • This is a substantial increase, but it's important to note that the midterm is only one factor contributing to a final 'A' grade.

R-squared ($oldsymbol{r^2}$) for Model Fit
  • Definition: R-squared is a measure of how well the regression model fits the observed data. It represents the proportion of the variance in the dependent variable (y) that is predictable from the independent variable (x).

  • Not related to R software: The name is similar but it's a statistical concept, not tied to the R programming language.

  • Formula: For simple linear regression, r2=(extcorrelation)2r^2 = ( ext{correlation})^2.

  • Interpretation of Values:

    • r2=1r^2 = 1: Indicates a perfect fit. The model explains 100 ext{%} of the variation in y. There is no error between predicted y and actual y (e.g., when correlation = 11 or 1-1).

    • r2=0r^2 = 0: Indicates no fit. The model explains 0 ext{%} of the variation in y. There is a large mismatch between predicted y and actual y (e.g., when correlation = 00).

  • Calculating r2r^2 for the 'Grade A' Model:

    • Correlation was 0.640.64.

    • r2=(0.64)2hickapprox0.4096hickapprox0.41r^2 = (0.64)^2 hickapprox 0.4096 hickapprox 0.41

    • Interpretation: The model using midterm grades explains about 41 ext{%} of the variation in whether a student gets an 'A' or 'A-'. This means midterm grades are a significant predictor, but they do not explain the majority of the variation, implying other factors are also very important.

  • Comparing r2r^2 with Non-Binary Outcome Model:

    • The previous model (midterm predicting final exam grade) had an r2hickapprox0.51r^2 hickapprox 0.51.

    • Important Caveat: You should only compare R-squared values to other models using the same outcome variable. Different outcomes have different inherent levels of predictability, so comparing r2r^2 across different outcome types can be misleading.

  • Missing Data: When calculating correlations or r2r^2 with missing data, specify na.rm = TRUE (or similar option) to exclude missing values from the calculation without permanently removing them from the dataset.

Conclusion
  • For predicting outcomes using linear models, aim for predictor (x) variables that are highly correlated with the outcome (y).

  • A higher correlation leads to a higher r2r^2, indicating a better model fit and more reliable predictions from a simple linear model.

  • Next lecture will focus on causal effects using observational studies.