Logistic Regression

Introduction to Logistic Regression

  • Logistic regression is used when the dependent variable (Y) is binary (e.g., Yes/No, Success/Failure).

  • The dependent variable can be recoded into a binary format such that Y can be 0 or 1.

  • The objective of logistic regression is to create a model that forecasts the probability of Y occurring.

  • Unlike linear regression, logistic regression models have a non-linear structure.

  • The interpretation of this model is distinct from linear regression models.

Mathematical Foundation of Logistic Regression

  • Logistic regression involves a more complex mathematical framework than a linear model due to its S-shaped curve.

  • The response of logistic regression resembles an "S" shaped curve, indicating probabilities ranging from 0% to 100% (or 0 to 1).

    • 0 represents "no chance" of the event occurring.

    • 1 represents "100% certainty" of the event occurring.

  • After setting up the model with numerous predictor variables (X), a cutoff value is established.

    • A common cutoff value is 0.5.

    • Predictions at or above this cutoff are classified as Y=1 (event occurring).

Logistic Regression Equation and Odds

  • The prediction variable (Ŷ) must remain between 0 and 1.

  • Consequently, the logistic function is used to forecast probabilities:
    P(Y=1)=e(b<em>0+b</em>1x)1+e(b<em>0+b</em>1x)P(Y=1) = \frac{e^{(b<em>0 + b</em>1 x)}}{1 + e^{(b<em>0 + b</em>1 x)}}

  • The odds of event occurrence (Y) can be calculated as follows:
    Odds of Y=P(Y=1)P(Y=0)=P(Y=1)1P(Y=1)\text{Odds of } Y = \frac{P(Y=1)}{P(Y=0)} = \frac{P(Y=1)}{1 - P(Y=1)}

  • In linear regression, the slope (B₁) represents the fluctuation in Y due to a one-unit rise in X. In logistic regression, it alters the odds of Y occurring instead.

Application of Logistic Regression: Case Study on Menarche

  • Dataset involving teenage girls regarding whether their menstrual periods have begun (Menarche).

  • Response Variable:

    • Y=1 indicates menstruating.

    • Y=0 indicates not menstruating.

  • Predictor Variables:

    • BMI

    • Body Fat (measured in mm via skinfold thickness)

    • Participation in Sports (1 for Yes, 0 for No)

  • Example of the first 6 observations:

    • 1: Menarche = 1, BMI = 19.3, Fat = 23.9, Sports = 1

    • 2: Menarche = 1, BMI = 23.0, Fat = 28.8, Sports = 1

    • 3: Menarche = 1, BMI = 27.8, Fat = 32.4, Sports = 0

    • 4: Menarche = 1, BMI = 20.9, Fat = 25.8, Sports = 0

    • 5: Menarche = 0, BMI = 20.4, Fat = 22.5, Sports = 0

    • 6: Menarche = 1, BMI = 20.4, Fat = 22.1, Sports = 0

Model Fitting and Interpretation of Results

  • Logistic Regression Model Creation:

    • Code:
      R model <- glm(Menarche ~ ., data = Menarche, family = "binomial") summary(model)

  • Sample Output Summary:

    • Call: glm(formula = Menarche ~ ., family = "binomial", data = Menarche)

    • Coefficients:

      • Intercept: Estimate = -2.64967, Std. Error = 1.42254, z value = -1.863, Pr(>|z|) = 0.06251

      • BMI: Estimate = 0.39786, Std. Error = 0.13420, z value = 2.965, Pr(>|z|) = 0.00303 (significant)

      • Fat: Estimate = -0.17985, Std. Error = 0.07174, z value = -2.507, Pr(>|z|) = 0.01218 (significant)

      • Sports: Estimate = -0.36924, Std. Error = 0.72512, z value = -0.509, Pr(>|z|) = 0.61060 (not significant)

    • Significance level codes: 0 ‘’ (p < 0.001), 0.001 ‘’ (p < 0.01), 0.01 ‘’ (p < 0.05), 0.1 ‘.’ (p < 0.1), and 1 (non-significant).

Residuals and Model Comparison

  • Null deviance: 120.09 on 91 degrees of freedom

  • Residual deviance: 108.61 on 88 degrees of freedom

  • AIC = 116.61

  • Sports variable shows p-value > 0.05; thus, it is not a significant predictor and the model can be re-estimated without it.

Re-fitted Model Without Non-Significant Variable

  • Logistic regression without Sports:

    • Code:
      R model <- glm(Menarche ~ BMI + Fat, data = Menarche, family = "binomial") summary(model)

    • Output Summary:

    • Intercept: Estimate = -2.73099, Std. Error = 1.40712, z value = -1.941, Pr(>|z|) = 0.05228

    • BMI: Estimate = 0.38946, Std. Error = 0.13201, z value = 2.950, Pr(>|z|) = 0.00318 (significant)

    • Fat: Estimate = -0.17208, Std. Error = 0.06945, z value = -2.478, Pr(>|z|) = 0.01322 (significant)

    • Null deviance remains at 120.09 on 91 degrees of freedom

    • Residual deviance: 108.86 on 89 degrees of freedom

    • AIC yields lower value of 114.86, indicating a better model fit.

Coefficient Interpretation

  • Interpretation of logistic regression coefficients:

    • Ignoring the intercept, the increase in BMI (each 1 unit increase) results in the odds of having Menarche increasing by 47.6% (1.476 - 1 = 0.476).

    • For Fat (measured in mm), each 1mm increase decreases the odds of having Menarche by 15.8% (1 - 0.842 = 0.158).

Making Predictions Using the Model

  • Prediction for new data based on BMI and Fat levels:

    • Code:
      R newdata <- data.frame(BMI = c(20, 26), Fat = c(30, 22)) predict.glm(model, newdata = newdata, type = "response")

  • Expected Output for Predictions:

    • Each outcome reflects the predicted probability based on the supplied BMI and Fat levels.

  • Example hand calculation for subject with BMI=26 and Fat=22:

    • Calculate:
      P(Y=1)=e2.73+0.389(26)0.172(22)1+e2.73+0.389(26)0.172(22)P(Y=1) = \frac{e^{-2.73 + 0.389(26) - 0.172(22)}}{1 + e^{-2.73 + 0.389(26) - 0.172(22)}}

    • Resulting probability: 0.973 (or 97.3%).

Search Algorithms for Logistic Models

  • Logistic regression can employ stepwise algorithms for model selection when numerous predictors exist.

  • Reference to NHANES dataset for an analysis of Diabetes (Yes/No):

    • Code:
      R library(tidyverse) library(NHANES) data(NHANES) NHdata <- NHANES %>% dplyr::select(BMI, Age, Gender, MaritalStatus, HomeOwn, Pulse, BPSysAve, BPDiaAve, TotChol, Diabetes, SleepTrouble, PhysActive, HardDrugs) NHdata2 <- na.omit(NHdata) # Removes missing variables

Stepwise Algorithm Implementation

  • Full Model Creation:
    R full <- glm(Diabetes ~ ., data = NHdata2, family = "binomial") stepwise <- step(full, direction = "both", trace = 0) summary(stepwise)

  • Results include significant predictors such as BMI, Age, Gender, and more with respective estimates and z-values that analyze the significance of characteristics influencing Diabetes.

Predictions from the Final Model

  • Model Simplification:

    • Non-significant predictors can be eliminated iteratively.

    • Remaining significant predictors demonstrate the effects on Diabetes.

Interpretation of Coefficients for Diabetes Prediction

  • Applying the exponential function to coefficients can yield odds ratios:

    • E.g., an odds ratio for HardDrugs indicates a 38.7% increase in Diabetes risk, while each unit increase in Total Cholesterol relates to a 19.6% decrease.

Practice Problems for Understanding and Application

  • Task using NHANES dataset regarding SleepTrouble:

    1. Implement stepwise to ascertain the predictive model.

    2. Identify which predictor positively impacts having Sleep Trouble (e.g. HardDrugsYes indicates a 122.9% increase in odds).

    3. Determine which factor most reduces the likelihood of having Sleep Trouble (Gendermale indicates a reduction of approximately 40%).

  • Provide summary statistics and model interpretations for educational purposes.

Reminders

  • A quiz covering Weeks 9 and 10 material has been announced. Students may utilize an 8.5 x 11-inch sheet for notes, calculators, but not cell phones during the quiz.