Step 1: Choose mapping function with unknown parameters.
Step 2: Define a loss function initialized with random Mean Squared Error (MSE) values.
Example Mapping:
Price = W₀ + W₁ * sqft + W₂ * year + W₃ * location.
Step 3: Optimize loss function using training data to obtain the best values of parameters.
Examples of optimized weights:
W₁₁ = 10.2, W₁ = 8.8, W₂ = -6.2, W₃ = 1.4.
Definition: A supervised machine learning algorithm for binary classification tasks.
Despite the name, it is a classification algorithm, not a regression.
Fundamental in more complex machine learning models, such as deep learning.
Predicts the probability of a binary outcome (e.g., positive or negative, 0 or 1).
n: Number of training examples.
m: Number of features.
xᵢ: Input feature vector for the i-th training example.
yᵢ: Label of the i-th training example (binary value [0,1]).
w: Parameters of the logistic regression model (coefficients).
w = [𝑤₀, 𝑤₁, …, 𝑤ₘ].
hₜ(w)(x): Mapping function representing predicted value.
Mapping Function:
P(y = 1|x) = sigmoid(z)
z = w₀ + w₁x₁ + w₂x₂ + … + wₘxₘ
Sigmoid function: f(z) = 1 / (1 + e^−z)
Output of logistic regression is a probability between 0 and 1.
Transforming linear function to values within range [0, 1].
Different w and different parameter values gives different mapping function
Which w is the best? This depends on your training data —> relationship between features and labels.
Purpose of machine learning —> find the optimum ws so that p(y) = 1 / as accurate as possible.
For binary classification:
N: Number of instances.
pᵢ: Predicted probability for class 1 for the i-th instance.
yᵢ: True label for the i-th instance.
For binary classification problem, cross-entropy loss (log-loss) is usually used to obtain optimal coefficients that minimize the difference between the predicted probability distribution of the model and the true label.
For optimization during training:
Loss for binary classification:
Loss(yᵢ, p(yᵢ = 1 | xᵢ)) = - log(p(yᵢ = 1 | xᵢ)) if yᵢ = 1.
Loss(yᵢ, p(yᵢ = 1 | xᵢ)) = - log(1 - p(yᵢ = 1 | xᵢ)) if yᵢ = 0.
Overall loss: Loss = Σ (loss for i-th instance)
Similar to linear regression, we aim to find the best-fitting model (fit to training data) by minimizing the cross-entropy loss function.
The loss penalizes the model more for predicted probabilities that are far from the true labels.
Gradient Descent Process (often applied):
Random initialization of w
Iteratively update the weights using the gradient of the loss function with respect to the weights, adjusting them in the direction that minimizes the loss.
This process continues until convergence is reached, meaning the changes in the loss function are sufficiently small.
At this point, the final weights can be used to make predictions on new data, allowing us to classify instances based on the learned model.
Sign of Coefficients:
Positive: Feature positively impacts probability of outcome being 1.
Negative: Feature negatively impacts probability of outcome.
Magnitude of Coefficients:
Large coefficient indicates strong impact on outcome probability.
Example: In credit card fraud detection:
Transaction amount coefficient = 2.5 (positive impact);
Transaction location coefficient = -1.2 (negative impact);
Time of day coefficient = 3.8 (positive impact).
Logistic regression learns a linear decision boundary.
Decision thresholds indicate classifications based on linear relationships derived from features.
Points on the decision boundary —> the model cannot decide/tell the class output/probability
A fundamental model in machine learning.
Acts as a building block for more complex models.
Pros:
Simple and fast.
Easy to interpret.
Cons:
Limited to binary classification.
Sensitive to outliers.
Less accurate compared to more advanced models.
Predicting classes among three or more categories. The model needs to predict a probability distribution over all possible classes for each example.
Examples include:
Classifying product types (e.g., electronics, clothing);
Classifying images or news articles.
A machine learning model used to predict the multi-class classification
Generalizes logistic regression from binary classification to multi-class classification using the softmax function to model the relationship between the input features and the probability of each class.
Each class label zₖ = w₀ₖ + w₁ₖx₁ + ... + wₘₖxₘ.
Predicts probability distribution for K classes.
Loss calculated similarly to logistic regression.