8-Linear Regression

Regression Overview
  • Definition: Regression models in supervised learning predict a continuous numerical label based on a set of features.

  • Examples of Regression Applications:

    • Predicting customer credit card activities from demographics and historical data.

    • Estimating driving/pickup time from point A to point B.

  • Classification vs Regression Models:

    • Classification models predict probabilities of categories (e.g., class A, B).

    • Regression models predict numerical values.

  • Note: Decision trees can be used for regression tasks with specific methods for determining root and leaf nodes based on variance. If the data you used to train your model is categorial (non-numerical), then it is a classification model. If it is numerical, then it is a regression model.

Linear Regression
  • Definition: A supervised learning algorithm used to predict continuous values (label) based on input features.

  • Functionality: Models the relationship between input features (X) and output values (Y) as a linear function.

  • Importance: Considered fundamental in machine learning; serves as a basis for more complex models (e.g., deep learning).

Supervised Learning Components
  • Core Elements:

    • Parameters: Coefficients that define the mapping function.

    • Features (F): Input variables used for prediction.

    • Outputs: The predicted continuous values.

    • Labels: Actual outcomes associated with the training data.

    • Cost and Loss: Metrics used for evaluating model performance.

Notation for Linear Regression
  • Variables:

    • n: Number of training examples

    • m: Number of features

    • xi: Feature vector of the ith training example

    • yi: Label of the ith training example

    • w: Set of parameters (coefficients) of the regression model ([w0, w1, …, wm]).

    • hw(x): Predicted value based on the mapping function.

Learning Process in Linear Regression
  1. Choose a Mapping Function:

    • With unknown parameters initialized with random values.

  2. Define Loss Function:

    • Assess the difference between predicted and actual values.

  3. Optimize the Loss Function:

    • Minimize the loss to obtain the best parameter values.

    • Example Equation: Price = w0 + w1 * sqft + w2 * year + w3 * location.

Step 1: Mapping Function

Step 2: Loss Function in Linear Regression
  • Functionality: Measures the error between predicted values and actual values.

  • Objective: Minimize the sum of squared errors to find the best-fitting hyperplane (optimal w).

  • Loss Function Formula: Loss = sum of squared differences between predicted (Y) and true values (Y).

Step 3: Optimization with Gradient Descent
  • Gradient descent: Fundamental optimization algorithm to minimize the loss function in machine learning and find the optimal parameters for a model.

  • Widely used in training neural networks, linear regression, logistic regression

  • Steps Involved:

    • Compute the gradient of the loss function with respect to parameters.

    • Update parameters: adjust parameters in the direction that reduces the loss.

    • Repeat until convergence. Evaluate the model performance on a validation dataset to ensure that it generalizes well to unseen data.

Gradient Descent Steps
  1. Pick an initial value for w.

  2. Calculate gradient concerning parameters.

  3. Update parameters using:

    • w = w - learning_rate * gradient

  4. Repeat until a stopping criterion is met.

Gradient Descent Elasticity Example
  • Startup Point: Learning rate (lr) is set to 0.1, initial w = -10.

  • Calculation Example: Gradient calculation simplifies loss = w^2 + 2.

Implications of Learning Rate
  • Effects:

    • Too small: Slow convergence.

    • Too high: May overshoot the minimum, preventing convergence.

Linear Regression Interpretability
  • Coefficient Interpretation:

    • Coefficients (w1, w2, ..., wm) indicate the effect of features on the predicted label (y).

    • Example Interpretation: A coefficient (e.g., w1 = 2) suggests a one-unit increase in x1 leads to a two-unit increase in y.

    • The sign (+ or -) of the coefficients indicates the direction of the relationship between the features and the label.

  • Normalization: In normalized features, coefficient magnitude indicates the importance of the feature on the label. This means that a feature with a larger coefficient has a greater impact on the prediction of y compared to features with smaller coefficients.

Model Complexity in Machine Learning
  • Machine learning model is fitted to a training dataset to learn the model parameters.

  • Aim: Avoid underfitting (too simplistic) and overfitting (too complex).

  • Overfitting: When a model learns training data too well instead of generalizing from it. Causes: Complexity, excessive features, insufficient training data.

  • Underfitting: When a model is too simple to capture the underlying patterns in the data, leading to poor performance on both training and test datasets. Causes: Insufficient model complexity, lack of relevant features, or overly restrictive assumptions.

  • To strike a balance, it's essential to select the right model complexity and perform techniques such as cross-validation to ensure the model generalizes well.

Strategies to Avoid Overfitting
  • Increase Data: More rows of data rather than columns. Obtaining more training data may not always be feasible.

  • Feature Selection: Choose the most relevant features, potentially using domain knowledge or filter method.

  • Regularization: Add a penalty term to the loss function (e.g., Lasso regression).

Lasso Regression
  • Concept: Lasso regression (L1 regularization) encourages a sparse (thinly dispersed) model with a few non-zero coefficients. This means that features that are not important/relevant will have a coefficient of 0. This helps in feature selection, as it effectively reduces the number of variables in the model, leading to improved interpretability and reduced overfitting.

  • Rule of Thumb: Ensure n (number of examples) is greater than 10 times m (number of features).

  • Points:

    • IIrrelevant/uninformative features will have zero coefficient.

    • The sparse regression model is also good for explanation. It helps us to pick up the informative features.

    • Lasso regression is a simple techniques to reduce model

      complexity and prevent over-fitting which may result from

      simple linear regression

Linear Regression Summary
  • Applications: House price prediction, sales forecasting, stock price prediction.

  • Pros: Simple, fast, interpretable.

  • Cons: Assumes linear relationships, sensitive to outliers, not suitable for complex/non-linear patterns.