9-Logistic Regression

Step 1: Choose mapping function with unknown parameters.
Step 2: Define a loss function initialized with random Mean Squared Error (MSE) values.
- Example Mapping:
  - Price = W₀ + W₁ * sqft + W₂ * year + W₃ * location.
Step 3: Optimize loss function using training data to obtain the best values of parameters.
- Examples of optimized weights:
  - W₁₁ = 10.2, W₁ = 8.8, W₂ = -6.2, W₃ = 1.4.

Definition: A supervised machine learning algorithm for binary classification tasks.
Despite the name, it is a classification algorithm, not a regression.
Fundamental in more complex machine learning models, such as deep learning.
Predicts the probability of a binary outcome (e.g., positive or negative, 0 or 1).

n: Number of training examples.
m: Number of features.
xᵢ: Input feature vector for the i-th training example.
yᵢ: Label of the i-th training example (binary value [0,1]).
w: Parameters of the logistic regression model (coefficients).
- w = [𝑤₀, 𝑤₁, …, 𝑤ₘ].
hₜ(w)(x): Mapping function representing predicted value.

Mapping Function:
- P(y = 1|x) = sigmoid(z)
- z = w₀ + w₁x₁ + w₂x₂ + … + wₘxₘ
- Sigmoid function: f(z) = 1 / (1 + e^−z)
Output of logistic regression is a probability between 0 and 1.
Transforming linear function to values within range [0, 1].
Different w and different parameter values gives different mapping function
Which w is the best? This depends on your training data —> relationship between features and labels.
Purpose of machine learning —> find the optimum ws so that p(y) = 1 / as accurate as possible.

For binary classification:
- N: Number of instances.
- pᵢ: Predicted probability for class 1 for the i-th instance.
- yᵢ: True label for the i-th instance.

For binary classification problem, cross-entropy loss (log-loss) is usually used to obtain optimal coefficients that minimize the difference between the predicted probability distribution of the model and the true label.
For optimization during training:
- Loss for binary classification:
  - Loss(yᵢ, p(yᵢ = 1 | xᵢ)) = - log(p(yᵢ = 1 | xᵢ)) if yᵢ = 1.
  - Loss(yᵢ, p(yᵢ = 1 | xᵢ)) = - log(1 - p(yᵢ = 1 | xᵢ)) if yᵢ = 0.
Overall loss: Loss = Σ (loss for i-th instance)

Similar to linear regression, we aim to find the best-fitting model (fit to training data) by minimizing the cross-entropy loss function.
The loss penalizes the model more for predicted probabilities that are far from the true labels.
Gradient Descent Process (often applied):
- Random initialization of w
- Iteratively update the weights using the gradient of the loss function with respect to the weights, adjusting them in the direction that minimizes the loss.
- This process continues until convergence is reached, meaning the changes in the loss function are sufficiently small.
- At this point, the final weights can be used to make predictions on new data, allowing us to classify instances based on the learned model.

Sign of Coefficients:
- Positive: Feature positively impacts probability of outcome being 1.
- Negative: Feature negatively impacts probability of outcome.
Magnitude of Coefficients:
- Large coefficient indicates strong impact on outcome probability.
Example: In credit card fraud detection:
- Transaction amount coefficient = 2.5 (positive impact);
- Transaction location coefficient = -1.2 (negative impact);
- Time of day coefficient = 3.8 (positive impact).

Logistic regression learns a linear decision boundary.
Decision thresholds indicate classifications based on linear relationships derived from features.
Points on the decision boundary —> the model cannot decide/tell the class output/probability

A fundamental model in machine learning.
Acts as a building block for more complex models.
Pros:
- Simple and fast.
- Easy to interpret.
Cons:
- Limited to binary classification.
- Sensitive to outliers.
- Less accurate compared to more advanced models.

Predicting classes among three or more categories. The model needs to predict a probability distribution over all possible classes for each example.
Examples include:
- Classifying product types (e.g., electronics, clothing);
- Classifying images or news articles.

A machine learning model used to predict the multi-class classification
Generalizes logistic regression from binary classification to multi-class classification using the softmax function to model the relationship between the input features and the probability of each class.
Each class label zₖ = w₀ₖ + w₁ₖx₁ + ... + wₘₖxₘ.
Predicts probability distribution for K classes.
Loss calculated similarly to logistic regression.