Logistic Regression
Introduction to Logistic Regression
Logistic regression is used when the dependent variable (Y) is binary (e.g., Yes/No, Success/Failure).
The dependent variable can be recoded into a binary format such that Y can be 0 or 1.
The objective of logistic regression is to create a model that forecasts the probability of Y occurring.
Unlike linear regression, logistic regression models have a non-linear structure.
The interpretation of this model is distinct from linear regression models.
Mathematical Foundation of Logistic Regression
Logistic regression involves a more complex mathematical framework than a linear model due to its S-shaped curve.
The response of logistic regression resembles an "S" shaped curve, indicating probabilities ranging from 0% to 100% (or 0 to 1).
0 represents "no chance" of the event occurring.
1 represents "100% certainty" of the event occurring.
After setting up the model with numerous predictor variables (X), a cutoff value is established.
A common cutoff value is 0.5.
Predictions at or above this cutoff are classified as Y=1 (event occurring).
Logistic Regression Equation and Odds
The prediction variable (Ŷ) must remain between 0 and 1.
Consequently, the logistic function is used to forecast probabilities:
The odds of event occurrence (Y) can be calculated as follows:
In linear regression, the slope (B₁) represents the fluctuation in Y due to a one-unit rise in X. In logistic regression, it alters the odds of Y occurring instead.
Application of Logistic Regression: Case Study on Menarche
Dataset involving teenage girls regarding whether their menstrual periods have begun (Menarche).
Response Variable:
Y=1 indicates menstruating.
Y=0 indicates not menstruating.
Predictor Variables:
BMI
Body Fat (measured in mm via skinfold thickness)
Participation in Sports (1 for Yes, 0 for No)
Example of the first 6 observations:
1: Menarche = 1, BMI = 19.3, Fat = 23.9, Sports = 1
2: Menarche = 1, BMI = 23.0, Fat = 28.8, Sports = 1
3: Menarche = 1, BMI = 27.8, Fat = 32.4, Sports = 0
4: Menarche = 1, BMI = 20.9, Fat = 25.8, Sports = 0
5: Menarche = 0, BMI = 20.4, Fat = 22.5, Sports = 0
6: Menarche = 1, BMI = 20.4, Fat = 22.1, Sports = 0
Model Fitting and Interpretation of Results
Logistic Regression Model Creation:
Code:
R model <- glm(Menarche ~ ., data = Menarche, family = "binomial") summary(model)
Sample Output Summary:
Call: glm(formula = Menarche ~ ., family = "binomial", data = Menarche)
Coefficients:
Intercept: Estimate = -2.64967, Std. Error = 1.42254, z value = -1.863, Pr(>|z|) = 0.06251
BMI: Estimate = 0.39786, Std. Error = 0.13420, z value = 2.965, Pr(>|z|) = 0.00303 (significant)
Fat: Estimate = -0.17985, Std. Error = 0.07174, z value = -2.507, Pr(>|z|) = 0.01218 (significant)
Sports: Estimate = -0.36924, Std. Error = 0.72512, z value = -0.509, Pr(>|z|) = 0.61060 (not significant)
Significance level codes: 0 ‘’ (p < 0.001), 0.001 ‘’ (p < 0.01), 0.01 ‘’ (p < 0.05), 0.1 ‘.’ (p < 0.1), and 1 (non-significant).
Residuals and Model Comparison
Null deviance: 120.09 on 91 degrees of freedom
Residual deviance: 108.61 on 88 degrees of freedom
AIC = 116.61
Sports variable shows p-value > 0.05; thus, it is not a significant predictor and the model can be re-estimated without it.
Re-fitted Model Without Non-Significant Variable
Logistic regression without Sports:
Code:
R model <- glm(Menarche ~ BMI + Fat, data = Menarche, family = "binomial") summary(model)Output Summary:
Intercept: Estimate = -2.73099, Std. Error = 1.40712, z value = -1.941, Pr(>|z|) = 0.05228
BMI: Estimate = 0.38946, Std. Error = 0.13201, z value = 2.950, Pr(>|z|) = 0.00318 (significant)
Fat: Estimate = -0.17208, Std. Error = 0.06945, z value = -2.478, Pr(>|z|) = 0.01322 (significant)
Null deviance remains at 120.09 on 91 degrees of freedom
Residual deviance: 108.86 on 89 degrees of freedom
AIC yields lower value of 114.86, indicating a better model fit.
Coefficient Interpretation
Interpretation of logistic regression coefficients:
Ignoring the intercept, the increase in BMI (each 1 unit increase) results in the odds of having Menarche increasing by 47.6% (1.476 - 1 = 0.476).
For Fat (measured in mm), each 1mm increase decreases the odds of having Menarche by 15.8% (1 - 0.842 = 0.158).
Making Predictions Using the Model
Prediction for new data based on BMI and Fat levels:
Code:
R newdata <- data.frame(BMI = c(20, 26), Fat = c(30, 22)) predict.glm(model, newdata = newdata, type = "response")
Expected Output for Predictions:
Each outcome reflects the predicted probability based on the supplied BMI and Fat levels.
Example hand calculation for subject with BMI=26 and Fat=22:
Calculate:
Resulting probability: 0.973 (or 97.3%).
Search Algorithms for Logistic Models
Logistic regression can employ stepwise algorithms for model selection when numerous predictors exist.
Reference to NHANES dataset for an analysis of Diabetes (Yes/No):
Code:
R library(tidyverse) library(NHANES) data(NHANES) NHdata <- NHANES %>% dplyr::select(BMI, Age, Gender, MaritalStatus, HomeOwn, Pulse, BPSysAve, BPDiaAve, TotChol, Diabetes, SleepTrouble, PhysActive, HardDrugs) NHdata2 <- na.omit(NHdata) # Removes missing variables
Stepwise Algorithm Implementation
Full Model Creation:
R full <- glm(Diabetes ~ ., data = NHdata2, family = "binomial") stepwise <- step(full, direction = "both", trace = 0) summary(stepwise)Results include significant predictors such as BMI, Age, Gender, and more with respective estimates and z-values that analyze the significance of characteristics influencing Diabetes.
Predictions from the Final Model
Model Simplification:
Non-significant predictors can be eliminated iteratively.
Remaining significant predictors demonstrate the effects on Diabetes.
Interpretation of Coefficients for Diabetes Prediction
Applying the exponential function to coefficients can yield odds ratios:
E.g., an odds ratio for HardDrugs indicates a 38.7% increase in Diabetes risk, while each unit increase in Total Cholesterol relates to a 19.6% decrease.
Practice Problems for Understanding and Application
Task using NHANES dataset regarding SleepTrouble:
Implement stepwise to ascertain the predictive model.
Identify which predictor positively impacts having Sleep Trouble (e.g. HardDrugsYes indicates a 122.9% increase in odds).
Determine which factor most reduces the likelihood of having Sleep Trouble (Gendermale indicates a reduction of approximately 40%).
Provide summary statistics and model interpretations for educational purposes.
Reminders
A quiz covering Weeks 9 and 10 material has been announced. Students may utilize an 8.5 x 11-inch sheet for notes, calculators, but not cell phones during the quiz.