9-Logistic Regression

Recap of Linear Regression

  • Step 1: Choose mapping function with unknown parameters.

  • Step 2: Define a loss function initialized with random Mean Squared Error (MSE) values.

    • Example Mapping:

      • Price = W₀ + W₁ * sqft + W₂ * year + W₃ * location.

  • Step 3: Optimize loss function using training data to obtain the best values of parameters.

    • Examples of optimized weights:

      • W₁₁ = 10.2, W₁ = 8.8, W₂ = -6.2, W₃ = 1.4.

Introduction to Logistic Regression

  • Definition: A supervised machine learning algorithm for binary classification tasks.

  • Despite the name, it is a classification algorithm, not a regression.

  • Fundamental in more complex machine learning models, such as deep learning.

  • Predicts the probability of a binary outcome (e.g., positive or negative, 0 or 1).

Notation in Logistic Regression

  • n: Number of training examples.

  • m: Number of features.

  • xᵢ: Input feature vector for the i-th training example.

  • yᵢ: Label of the i-th training example (binary value [0,1]).

  • w: Parameters of the logistic regression model (coefficients).

    • w = [𝑤₀, 𝑤₁, …, 𝑤ₘ].

  • hₜ(w)(x): Mapping function representing predicted value.

Sigmoid Function

  • Mapping Function:

    • P(y = 1|x) = sigmoid(z)

    • z = w₀ + w₁x₁ + w₂x₂ + … + wₘxₘ

    • Sigmoid function: f(z) = 1 / (1 + e^−z)

  • Output of logistic regression is a probability between 0 and 1.

  • Transforming linear function to values within range [0, 1].

  • Different w and different parameter values gives different mapping function

  • Which w is the best? This depends on your training data —> relationship between features and labels.

  • Purpose of machine learning —> find the optimum ws so that p(y) = 1 / as accurate as possible.

Log Loss

  • For binary classification:

    • N: Number of instances.

    • pᵢ: Predicted probability for class 1 for the i-th instance.

    • yᵢ: True label for the i-th instance.

Cross-Entropy Loss Function

  • For binary classification problem, cross-entropy loss (log-loss) is usually used to obtain optimal coefficients that minimize the difference between the predicted probability distribution of the model and the true label.

  • For optimization during training:

    • Loss for binary classification:

      • Loss(yᵢ, p(yᵢ = 1 | xᵢ)) = - log(p(yᵢ = 1 | xᵢ)) if yᵢ = 1.

      • Loss(yᵢ, p(yᵢ = 1 | xᵢ)) = - log(1 - p(yᵢ = 1 | xᵢ)) if yᵢ = 0.

  • Overall loss: Loss = Σ (loss for i-th instance)

Optimization Process

  • Similar to linear regression, we aim to find the best-fitting model (fit to training data) by minimizing the cross-entropy loss function.

  • The loss penalizes the model more for predicted probabilities that are far from the true labels.

  • Gradient Descent Process (often applied):

    • Random initialization of w

    • Iteratively update the weights using the gradient of the loss function with respect to the weights, adjusting them in the direction that minimizes the loss.

    • This process continues until convergence is reached, meaning the changes in the loss function are sufficiently small.

    • At this point, the final weights can be used to make predictions on new data, allowing us to classify instances based on the learned model.

Interpretability of Logistic Regression

  • Sign of Coefficients:

    • Positive: Feature positively impacts probability of outcome being 1.

    • Negative: Feature negatively impacts probability of outcome.

  • Magnitude of Coefficients:

    • Large coefficient indicates strong impact on outcome probability.

  • Example: In credit card fraud detection:

    • Transaction amount coefficient = 2.5 (positive impact);

    • Transaction location coefficient = -1.2 (negative impact);

    • Time of day coefficient = 3.8 (positive impact).

Decision Boundary of Logistic Regression

  • Logistic regression learns a linear decision boundary.

  • Decision thresholds indicate classifications based on linear relationships derived from features.

  • Points on the decision boundary —> the model cannot decide/tell the class output/probability

Summary of Logistic Regression

  • A fundamental model in machine learning.

  • Acts as a building block for more complex models.

  • Pros:

    • Simple and fast.

    • Easy to interpret.

  • Cons:

    • Limited to binary classification.

    • Sensitive to outliers.

    • Less accurate compared to more advanced models.

Multi-Class Classification

  • Predicting classes among three or more categories. The model needs to predict a probability distribution over all possible classes for each example.

  • Examples include:

    • Classifying product types (e.g., electronics, clothing);

    • Classifying images or news articles.

Softmax Regression

  • A machine learning model used to predict the multi-class classification

  • Generalizes logistic regression from binary classification to multi-class classification using the softmax function to model the relationship between the input features and the probability of each class.

  • Each class label zₖ = w₀ₖ + w₁ₖx₁ + ... + wₘₖxₘ.

  • Predicts probability distribution for K classes.

  • Loss calculated similarly to logistic regression.