Logistic regression is a statistical method used for predictive analytics, particularly when the response variable is categorical [1, 2]. It is a type of classification algorithm that estimates the probability of a binary outcome [3, 4]. Although it shares some similarities with linear regression, a key difference is that logistic regression is designed to handle categorical response variables, while linear regression is used to estimate a continuous numerical variable [3].
Core Concepts
Categorical Response Variable: Logistic regression is used when the outcome or response variable is categorical, meaning it falls into distinct categories [3]. Typically, this is a binary or binomial variable (e.g., yes/no, pass/fail), though modified versions can handle multi-class outputs [1, 3].
Probability-Based: Logistic regression models the probability of an event occurring, with outcomes ranging between 0 and 1 [1, 5].
Logit Transformation: Logistic regression uses the natural logarithm of the odds of the response variable to create a continuous criterion [1]. The logit transformation is the link function that connects the linear combination of predictors to the probability of the outcome [1, 6].
Logistic Function: The logistic function, also known as the sigmoid function, is central to logistic regression. This S-shaped function can only take values between 0 and 1 and is used to predict the probability of the outcome [5, 7].
The function is represented as: f(y) = 1 / (1 + e^-(b0 + b1x)), where b0 is the intercept and b1 is the slope [5].
Maximum Likelihood Estimation: The coefficients (b's) in logistic regression are estimated using the maximum likelihood estimation (MLE) method. Unlike linear regression, there is no closed-form solution, so an iterative process is used to find the coefficient values that maximize the likelihood function [8].
Similarities and Differences with Linear Regression
Similarities:
Both methods aim to model the relationship between a response variable and explanatory variables using a sample of past observations [3].
Both linear and logistic regression use a linear combination of predictors, represented as η(x) = β0 + β1x1 + β2x2 + ... + βp-1xp-1 [9, 10].
Differences:
Response Variable: Linear regression is used for continuous numerical response variables, whereas logistic regression is used for categorical response variables [3].
Output: The output of linear regression is a numerical variable, whereas the output of logistic regression is a class or probability [1, 3].
Link Function: Logistic regression uses a link function (logit) to relate the linear combination of predictors to the probability of the outcome. Linear regression does not use a link function [1].
Error Distribution: Logistic regression assumes a Bernoulli distribution, whereas linear regression assumes a normal distribution for the errors [2, 6].
Parameter Estimation: Logistic regression coefficients are estimated using maximum likelihood estimation, while linear regression typically uses ordinary least squares [8, 11, 12].
Model Development
Data: The data for logistic regression consists of observations, each with a set of attributes (explanatory variables) and a corresponding categorical response variable [13].
Model Equation: The logistic regression model is expressed as log(p(x) / (1-p(x))) = β0 + β1x1 + ... + βp-1xp-1, where p(x) is the probability of the outcome, and β's are the coefficients. The formula can be rearranged to give the probability p(x) = e^(β0 + β1x1 + ... + βp-1xp-1) / (1 + e^(β0 + β1x1 + ... + βp-1xp-1) [14, 15].
Parameter Estimation: The parameters of the logistic model are estimated using maximum likelihood estimation [8, 12].
Model Assessment: The model fit and validity are assessed because of restrictive assumptions [16].
Logistic Regression in Predictive Analytics
Probabilistic Models: Logistic regression is used to develop probabilistic models between one or more explanatory variables and a class/response variable [1]. These explanatory variables can be a mix of continuous and categorical [1].
Classification: Logistic regression is used to predict categorical outcomes by treating the response variable as the outcome of a Bernoulli trial [1].
Odds Ratio: Logistic regression can be used to estimate the odds ratio for a variable. The odds ratio is the ratio of the odds of an event occurring in one group compared to another [17].
R Implementation
The glm() function in R is used to fit generalized linear models, including logistic regression, which can be specified using the argument family=binomial. [18]
The predict() function is used to make predictions using a fitted logistic regression model, and predictions can be of the link type (log odds) or the response type (probability). [19].
The formula syntax used in linear regression can be used with logistic regression [20].
The summary() function provides details about the parameter estimates, standard errors, z-values and p-values of the logistic regression model [21].
The anova() function can be used to perform likelihood-ratio tests between nested logistic regression models [22].
Model Evaluation
Likelihood-Ratio Test: Used to compare nested logistic regression models by comparing the log-likelihoods of the models [23, 24].
Wald Test: Used to test if a single parameter in the model is equal to zero [21, 25].
Misclassification Rate: Used to evaluate the overall performance of a classifier by determining the proportion of incorrect classifications [26].
Confusion Matrix: A confusion matrix is used to evaluate the performance of a classification model, which shows counts of true positives, true negatives, false positives, and false negatives [27].
Sensitivity and Specificity: Sensitivity (or true positive rate) measures the proportion of actual positives that are correctly identified, and specificity (or true negative rate) measures the proportion of actual negatives that are correctly identified [28].
Classification
Logistic regression can be used for classification by assigning observations to a class based on the predicted probability. A common approach is to classify an observation as 1 if the predicted probability is greater than 0.5, and 0 otherwise [29].
Limitations
Logistic regression assumes a linear relationship between the predictors and the log-odds of the outcome, which may not always be the case [10, 30].
The model is sensitive to outliers, which can disproportionately influence the regression line [31].
Like other regression models, logistic regression does not establish causality [31].
Logistic regression is a versatile tool for classification tasks and predictive modeling, particularly when dealing with binary or categorical outcomes. It provides a framework for estimating probabilities, identifying significant predictors, and making informed decisions based on data.