In Depth Notes on Logistic Regression Analysis
Logistic Regression Analysis
Introduction to Logistic Regression
- Logistic Regression is a supervised machine learning algorithm used for binary classification, determining the relationship between a binary dependent variable and independent variables.
- Goal: Identify a well-fitting model that describes the relationship between a binary dependent variable (Y) and a set of independent variables.
- Y = 1 (event/success) or Y = 0 (no event/failure)
Basic Concepts
- Dependent Variable: A binary outcome (e.g., success/failure).
- Independent Variables: Variables that explain the dependent variable.
- Example Scenario: Examine the relationship between device type and blood pressure measures to classify whether a reading is from a GP or a home device.
Logistic Regression Basics
- Key Ideas:
- Probability: Instead of predicting 0 or 1, we predict the probability of an event happening (e.g., P(Y=1)).
- Odds: Odds indicate the likelihood of an event occurring as a function of probability.
- Computation: Odds (O) = P/(1 - P)
- Log Odds (Logit): To remove boundaries from probability (0-1) to (-∞, +∞), we use log odds:
- Logit(P) = log(P/(1-P))
Modeling with Logistic Regression
- Logistic Regression Equation:
- logit(P(X)) = β0 + β1x1 + β2x2 + … + βpxp
- P(X) = e^(β0 + β1x1 + β2x2 + … + βpxp) / (1 + e^(β0 + β1x1 + β2x2 + … + βpxp))
- Max Likelihood Estimation (MLE): Used to estimate parameters. It finds the values of parameters that maximize the likelihood of the observed data.
- Log-Likelihood Function: Logarithm of the likelihood function simplifies the estimation task, allowing optimization of parameters.
Interpretation of Outcomes
- Coefficients (β):
- β0: Log odds of the baseline event.
- βi: Log odds ratio for each independent variable.
- e^(βi): Odds ratio indicating how changes in the independent variable influence the odds of the outcome.
- Example: For a one-week increase in gestational age, the increase in the odds of normal birth weight could be expressed through the coefficient (β).
Goodness of Fit in Logistic Regression
- Goodness-of-Fit Tests: Assess how well models predict outcomes compared to observed data. Includes:
- Chi-Square Tests, Deviance, Hosmer-Lemeshow Tests.
- Pseudo R-Squared: Measures how much the logistic model explains variability in the data, such as McFadden's R².
- Classification Performance Metrics:
- Accuracy, Sensitivity, Specificity, Precision, F1 Score, and ROC Curve analysis for evaluating model performance.
Conclusion and Further Reading
- Logistic regression is a powerful tool for predicting binary outcomes. It balances interpretability and predictive ability effectively.
- Recommended Reading: Washington et al. (2020). Statistical and Econometric Methods for Transportation Data Analysis.
- Final Note: Understanding logistic regression involves conceptualizing how probability transforms into odds and log odds, making it a robust method for classification tasks.