Logistic Regression

Logistic regression is used when the dependent variable (Y) is binary (e.g., Yes/No, Success/Failure).
The dependent variable can be recoded into a binary format such that Y can be 0 or 1.
The objective of logistic regression is to create a model that forecasts the probability of Y occurring.
Unlike linear regression, logistic regression models have a non-linear structure.
The interpretation of this model is distinct from linear regression models.

Logistic regression involves a more complex mathematical framework than a linear model due to its S-shaped curve.
The response of logistic regression resembles an "S" shaped curve, indicating probabilities ranging from 0% to 100% (or 0 to 1).
- 0 represents "no chance" of the event occurring.
- 1 represents "100% certainty" of the event occurring.
After setting up the model with numerous predictor variables (X), a cutoff value is established.
- A common cutoff value is 0.5.
- Predictions at or above this cutoff are classified as Y=1 (event occurring).

The prediction variable (Ŷ) must remain between 0 and 1.
Consequently, the logistic function is used to forecast probabilities:
$P(Y=1) = \frac{e^{(b<em>0 + b</em>1 x)}}{1 + e^{(b<em>0 + b</em>1 x)}}$
The odds of event occurrence (Y) can be calculated as follows:
$\text{Odds of } Y = \frac{P(Y=1)}{P(Y=0)} = \frac{P(Y=1)}{1 - P(Y=1)}$
In linear regression, the slope (B₁) represents the fluctuation in Y due to a one-unit rise in X. In logistic regression, it alters the odds of Y occurring instead.

Dataset involving teenage girls regarding whether their menstrual periods have begun (Menarche).
Response Variable:
- Y=1 indicates menstruating.
- Y=0 indicates not menstruating.
Predictor Variables:
- BMI
- Body Fat (measured in mm via skinfold thickness)
- Participation in Sports (1 for Yes, 0 for No)
Example of the first 6 observations:
- 1: Menarche = 1, BMI = 19.3, Fat = 23.9, Sports = 1
- 2: Menarche = 1, BMI = 23.0, Fat = 28.8, Sports = 1
- 3: Menarche = 1, BMI = 27.8, Fat = 32.4, Sports = 0
- 4: Menarche = 1, BMI = 20.9, Fat = 25.8, Sports = 0
- 5: Menarche = 0, BMI = 20.4, Fat = 22.5, Sports = 0
- 6: Menarche = 1, BMI = 20.4, Fat = 22.1, Sports = 0

Logistic Regression Model Creation:
- Code:
  R model <- glm(Menarche ~ ., data = Menarche, family = "binomial") summary(model)
Sample Output Summary:
- Call: glm(formula = Menarche ~ ., family = "binomial", data = Menarche)
- Coefficients:
  - Intercept: Estimate = -2.64967, Std. Error = 1.42254, z value = -1.863, Pr(>|z|) = 0.06251
  - BMI: Estimate = 0.39786, Std. Error = 0.13420, z value = 2.965, Pr(>|z|) = 0.00303 (significant)
  - Fat: Estimate = -0.17985, Std. Error = 0.07174, z value = -2.507, Pr(>|z|) = 0.01218 (significant)
  - Sports: Estimate = -0.36924, Std. Error = 0.72512, z value = -0.509, Pr(>|z|) = 0.61060 (not significant)
- Significance level codes: 0 ‘’ (p < 0.001), 0.001 ‘’ (p < 0.01), 0.01 ‘’ (p < 0.05), 0.1 ‘.’ (p < 0.1), and 1 (non-significant).

Null deviance: 120.09 on 91 degrees of freedom
Residual deviance: 108.61 on 88 degrees of freedom
AIC = 116.61
Sports variable shows p-value > 0.05; thus, it is not a significant predictor and the model can be re-estimated without it.

Interpretation of logistic regression coefficients:
- Ignoring the intercept, the increase in BMI (each 1 unit increase) results in the odds of having Menarche increasing by 47.6% (1.476 - 1 = 0.476).
- For Fat (measured in mm), each 1mm increase decreases the odds of having Menarche by 15.8% (1 - 0.842 = 0.158).

Prediction for new data based on BMI and Fat levels:
- Code:
  R newdata <- data.frame(BMI = c(20, 26), Fat = c(30, 22)) predict.glm(model, newdata = newdata, type = "response")
Expected Output for Predictions:
- Each outcome reflects the predicted probability based on the supplied BMI and Fat levels.
Example hand calculation for subject with BMI=26 and Fat=22:
- Calculate:
  $P(Y=1) = \frac{e^{-2.73 + 0.389(26) - 0.172(22)}}{1 + e^{-2.73 + 0.389(26) - 0.172(22)}}$
- Resulting probability: 0.973 (or 97.3%).

Logistic regression can employ stepwise algorithms for model selection when numerous predictors exist.
Reference to NHANES dataset for an analysis of Diabetes (Yes/No):
- Code:
  R library(tidyverse) library(NHANES) data(NHANES) NHdata <- NHANES %>% dplyr::select(BMI, Age, Gender, MaritalStatus, HomeOwn, Pulse, BPSysAve, BPDiaAve, TotChol, Diabetes, SleepTrouble, PhysActive, HardDrugs) NHdata2 <- na.omit(NHdata) # Removes missing variables

Full Model Creation:
R full <- glm(Diabetes ~ ., data = NHdata2, family = "binomial") stepwise <- step(full, direction = "both", trace = 0) summary(stepwise)
Results include significant predictors such as BMI, Age, Gender, and more with respective estimates and z-values that analyze the significance of characteristics influencing Diabetes.

Model Simplification:
- Non-significant predictors can be eliminated iteratively.
- Remaining significant predictors demonstrate the effects on Diabetes.

Applying the exponential function to coefficients can yield odds ratios:
- E.g., an odds ratio for HardDrugs indicates a 38.7% increase in Diabetes risk, while each unit increase in Total Cholesterol relates to a 19.6% decrease.

Task using NHANES dataset regarding SleepTrouble:
1. Implement stepwise to ascertain the predictive model.
2. Identify which predictor positively impacts having Sleep Trouble (e.g. HardDrugsYes indicates a 122.9% increase in odds).
3. Determine which factor most reduces the likelihood of having Sleep Trouble (Gendermale indicates a reduction of approximately 40%).
Provide summary statistics and model interpretations for educational purposes.

A quiz covering Weeks 9 and 10 material has been announced. Students may utilize an 8.5 x 11-inch sheet for notes, calculators, but not cell phones during the quiz.