Module 4

Machine Learning Module 4: Supervised Learning - Regression

Manar Mohaisen
Department of Computer Science


Table of Contents

  • Supervised Learning

  • Regression

  • Linear Regression

  • Overfitting and Regularization

    • Ridge, Lasso, Elastic Net Regularizations

  • Polynomial Regression

  • Batch Gradient Descent, Minibatch Gradient Descent, Stochastic Gradient Descent

  • Questions and Feedback


Supervised Learning

  • The dataset is represented as pairs

    • Notation: (X, y)

    • Where:

      • N = dataset size (number of rows)

      • M = number of features (number of columns)

    • Each feature vector corresponds to a label.


Regression

  • Definition: Regression is the method that aims to establish an approximate relationship between a dependent continuous variable and one or more independent variables.


Linear Regression

  • Simple Linear Regression:

    • Involves a single dependent variable and a single independent variable.

    • Often referred to as ordinary least squares (OLS).

    • The objective is to find the best-fitting line through the data points.


Finding Weight and Bias in OLS

  • To determine the weight and bias of an OLS, minimize the total error between the actual output and the model’s output:

    • Mean Squared Error (MSE) is used for minimization.

    • Partial derivatives with respect to the coefficients are calculated to find optimal values.

Mathematical Formulation

  • The regression model can be expressed as: y<em>i=w</em>0+w<em>1x</em>iy<em>i = w</em>0 + w<em>1 x</em>i

    • Minimize MSE = rac{1}{N} extstyleig( extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle (yi - (w0 + w1 xi))^2ig)


Overfitting and Regularization

  • Overfitting:

    • Occurs when a model fits the training data too closely and fails to generalize to unseen data, producing high variance.

    • Solutions:

      • Regularization

      • Reducing model complexity

Regularization Techniques

  • Ridge Regression (L2 Regularization):

    • Adds a penalty proportional to the square of the coefficients.

    • Formulated as:
      W=extargmin<em>WXWy2+extstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleLextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleLI</em>wicW = ext{arg min}<em>W ||XW - y||^2 + extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle {L extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle L I</em>wi|^c|}

  • Lasso Regression (L1 Regularization):

    • Adds a penalty proportional to the absolute value of the coefficients.

    • Formulated as:
      w=extargmin<em>wXWy2extstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleextstyleLΣw</em>icw^* = ext{arg min}<em>w ||XW - y||^2 extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle L Σ |w</em>i| ≤ c

  • Elastic Net Regularization:

    • Combines Lasso and Ridge penalties.


Polynomial Regression

  • Definition: A special case of linear regression where the model includes polynomial terms up to a specified order.

  • Examples:

    • Order 2 with one feature: y=w<em>0+w</em>1x+w2x2y = w<em>0 + w</em>1 x + w_2 x^2

    • Order 2 with two features: y=w<em>0+w</em>1a+w<em>2b+w</em>3a2+w<em>4b2+w</em>5aby = w<em>0 + w</em>1 a + w<em>2 b + w</em>3 a^2 + w<em>4 b^2 + w</em>5 ab

Regularization in Polynomial Regression

  • Polynomial regression can benefit from regularization to improve model performance, especially in cases with noisy data.


Gradient Descent Algorithm

  • Definition: A common optimization algorithm used to train machine learning models by minimizing the cost function.

  • Cost Function: Represents the difference between actual and predicted outputs; also referred to as the loss function and optimization criterion.

Gradient Descent Process

  1. Initialization: Start with weights at time t = 0.

  2. For each iteration, compute the gradient with respect to each parameter.

  3. Update parameters based on the computed gradients:

    • w(t+1)=w(t)<br>ablaE(w)w(t+1) = w(t) - <br>abla E(w)

Variants of Gradient Descent
  1. Batch Gradient Descent: Utilizes the entire dataset for each iteration, leading to higher computational costs.

  2. Stochastic Gradient Descent (SGD): Updates weights using a single sample at random, speeding up learning.

  3. Minibatch Gradient Descent: A subset of the training data is used for each update, balancing efficiency and accuracy.

Tuning the Learning Rate
  • Affects convergence speed.

    • Large learning rate: Risk of overshooting the global minimum.

    • Small learning rate: Slower convergence.

    • Solutions: tuning the learning rate, using variable learning rates, or implementing momentum-based methods.


Conclusion

  • Regularization techniques and optimization algorithms like gradient descent are vital for improving the performance of machine learning models.

  • Understanding and applying these concepts can lead to better model generalization and prediction accuracy. By effectively managing these hyperparameters, practitioners can enhance their models' ability to learn from data and avoid overfitting, thus ensuring robustness in various applications.