Regression Notes

Regression

Regression Overview

  • Regression is an important machine learning problem that serves as a good starting point for understanding the field.

  • Regression models are used to predict a continuous value.

  • It is a statistical method to model the relationship between a dependent variable (target) and one or more independent variables (predictors).

  • Example: Predicting house prices based on features like size.

Types of Regression

  • Simple Linear Regression

  • Multiple Linear Regression

  • Polynomial Regression

  • Logistic Regression

  • Ridge Regression

  • Lasso Regression

  • Elastic Net Regression

  • Support Vector Regression

  • Decision Tree Regression

  • And many more…

Linear Regression

  • Useful for finding the relationship between two continuous variables: a predictor (independent) and a response (dependent) variable.

  • It looks for a statistical, not deterministic, relationship.

Deterministic vs. Statistical Relationship
  • Deterministic Relationship: One variable can be accurately expressed by the other. For example, converting temperature from Celsius to Fahrenheit.

  • Statistical Relationship: Not accurate in determining the relationship between two variables. For example, the relationship between height and weight.

Core Idea
  • Obtain a straight line or hyperplane that best fits a set of data points.

  • The best fit line minimizes the total prediction error (residual error), which is the distance between the point and the regression line.

Example: Cricket Chirps and Temperature
  • Crickets chirp more frequently on hotter days.

  • Goal: Learn a model to predict this relationship using collected data on chirps-per-minute and temperature.

Visualizing the Data
  • Plot the data to examine the relationship between chirps and temperature.

  • Determine if the relationship appears linear.

Regression Line/Hyperplane
  • Represents the linear relationship between the variables.

  • Data points are scattered around this line.

  • Residual error is the difference between actual and predicted values.

Equation of a Linear Model
  • In machine learning, the equation is written as: y=w<em>1x</em>1+by' = w<em>1x</em>1 + b

    • yy' is the predicted label (output).

    • w1w_1 is the weight of feature 1 (same as the slope).

    • x1x_1 is feature 1 (known input).

    • bb is the bias (y-intercept), sometimes referred to as w0w_0

Using the Model for Prediction
  • To predict the temperature (yy') for a new chirps-per-minute value (x1x_1), substitute the value into the model.

Models with Multiple Features
  • More sophisticated models can use multiple features, each with a separate weight (w<em>1,w</em>2w<em>1, w</em>2, etc.).

  • Example: Predicting house price based on living area, number of bedrooms, and number of bathrooms:
    y=w<em>1x</em>1+w<em>2x</em>2+w<em>3x</em>3+by' = w<em>1x</em>1 + w<em>2x</em>2 + w<em>3x</em>3 + b

Training and Loss

Training a Model
  • Training involves learning (determining) good values for all the weights and the bias from labeled examples.

  • The goal is to represent the function/hypothesis in a computer such that input x is well mapped to output y.

  • Machine learning represents the equation as: h<em>θ(x)=w</em>1xh<em>\theta(x) = w</em>1x

Empirical Risk Minimization
  • In supervised learning, the algorithm builds a model (hypothesis/function) by examining many examples (x inputs and corresponding y outputs).

  • It attempts to find a model that minimizes loss, called empirical risk minimization.

Loss Function
  • Measures the difference between actual values and predicted values.

  • Loss is the penalty for a bad prediction.

  • If the model's prediction is perfect, the loss is zero; otherwise, the loss is greater.

Cost Function
  • The process of minimizing the loss involves choosing the optimal parameters W and b.

  • The goal is to find a set of weights and biases that have low loss, on average, across all examples.

  • The function that measures the average loss over all training data points is defined as the Cost Function.

  • High loss models have larger errors (arrows representing loss).

  • Blue lines represent predictions.

Loss Function and Normal Distribution
  • Linear regression assumes the actual output y is normally distributed around the predicted value hθ(x)h_\theta(x) (or y').

  • The loss function can be obtained by using Maximum Likelihood Estimation (MLE).

  • Conceptually, maximize MLE = minimizing the cost function that is defined by using mean square error (MSE).

Squared Loss (L2 Loss)
  • The squared loss for a single example is:

  • Why squared?

    • Emphasizes larger errors (big mistakes are penalized more).

    • Ensures errors do not cancel out (negative errors won’t offset positive ones).

Mean Square Error (MSE)
  • MSE is the average squared loss per example over the whole dataset.

  • To calculate MSE, sum up all the squared losses for individual examples and then divide by the number of examples: MSE=1n<em>(x,y)D(yh</em>θ(x))2MSE = \frac{1}{n} \sum<em>{(x,y) \in D} (y - h</em>\theta(x))^2

    • (x, y) is an example in the dataset D.

    • x is the set of features.

    • y is the example's label.

    • hθ(x)h_\theta(x) is a function of the weights and bias in combination with the set of features x.

    • D is a data set containing many labeled examples (x, y) pairs.

    • n is the number of examples in D.

Factor in Cost Function
  • Many formulations of the cost function introduce a 1/2 factor in optimization:
    MSE=12n<em>(x,y)D(yh</em>θ(x))2MSE = \frac{1}{2n} \sum<em>{(x,y) \in D} (y - h</em>\theta(x))^2

  • Reasons:

    • The factor 1/2 cancels out the factor 2 in gradient derivation (simplifies gradient descent updates).

    • 1/2 is a constant multiplier; it does not change the optimal value but just scales the cost function.

Reducing Loss (Normal Equation)

  • To train a model, we need a good way to reduce the model’s loss.

  • Optimize the value of weight by using matrix algebra, i.e., ordinary least squares (OLS) solution.

  • We look forward to having the best weight value which can be used in the model to best fit the data, i.e., W which minimize the MSE.

  • The LR: h(x)=XW+bh(x) = XW + b, and we assume b is included in X (feature matrix) as a column with all the values is 1.

  • When derive the MSE to zero, W=(XTX)1XTYW = (X^TX)^{-1} X^T Y

  • Using OLS, there is no need to go for iteration, but it is computationally expensive due to matrix inversion.

  • If datasets is huge, (XTX)(X^TX) becomes huge.

Reducing Loss - Iterative Approach

  • An iterative approach is one widely used method for reducing loss and is as easy and efficient as walking down a hill.

  • You'll start with a wild guess ("The value of w1w_1 is 0.") and wait for the system to tell you what the loss is.

  • Then, you'll try another guess ("The value of w1w_1 is 0.5.") and see what the loss is.

  • The real trick is trying to find the best possible model as efficiently as possible.

  • A Machine Learning model is trained by starting with an initial guess for the weights and bias and iteratively adjusting those guesses until learning the weights and bias with the lowest possible loss.

  • Usually, you iterate until overall loss stops changing, or at least changes extremely slowly.

  • When that happens, we say that the model has converged.

How to Reduce Loss

  • Derivative of (yy)2(y - y')^2 with respect to the weights and biases tells us how loss changes for a given example.

  • Simple to compute and convex.

  • So we repeatedly take small steps in the direction that minimizes loss.

  • We call these Gradient Steps (But they're really negative Gradient Steps).

  • This strategy is called Gradient Descent.

  • Iteratively update W till it finds the best W which minimize the cost function.

  • It has a learning rate which control how much W is updated in each step of gradient descent.

    • This value can be tuned.

    • Too small then slow to converge.

    • Too big may be overshoot the minimum (by pass the minimum).

Linear Regression Summarized
  • Hypothesis: y=w<em>1x</em>1+w<em>2x</em>2++by' = w<em>1x</em>1 + w<em>2x</em>2 + … + b or y=w<em>0x</em>0+w<em>1x</em>1+w<em>2x</em>2++w<em>nx</em>ny' = w<em>0x</em>0 + w<em>1x</em>1 + w<em>2x</em>2 + … + w<em>nx</em>n

  • Parameters: w<em>0,w</em>1,w<em>2,w</em>3,,w<em>0, w</em>1, w<em>2, w</em>3,…,

  • Cost Function: J(w<em>0,w</em>1,)=12(x,y)D(yy)2J(w<em>0, w</em>1,…) = \frac{1}{2} \sum_{(x,y) \in D}(y - y')^2

  • Where J(W) = MSE

  • Goal: minimize J(w<em>0,w</em>1,,wn)J(w<em>0, w</em>1,…, w_n)

Polynomial Regression (Extension of LR)

  • Polynomial regression (PR): Extension of LR

  • Use when the complexity of a dataset exceeds the possibility of fitting performed using a straight line (obvious underfitting occurs if the original linear regression model is used).

  • Here, the nth power indicates the degree of the polynomial.

  • PR is a type of linear regression. Although its features are non-linear, the relationship between its weight parameters is still linear.

Flexibility and Overfitting
  • Flexibility of high-order polynomials

  • Overfitting: associated with very large estimated parameters weight

LR: Example in calculation

x y

1 2

2 2.8

3 3.6

4 4.5

5 5.1

Question:

  1. Find the values of W and b in the hypothesis h() = + Using Least Square Method

    Compute W =∑(𝑥𝑖−𝑥̄)(𝑦𝑖−𝑦̄) / ∑(𝑥𝑖−𝑥̄)²

    Compute b = 𝑦̄ − 𝑊𝑥̄

    Write the final hypothesis equation

  2. Make predictions using the hypothesis, i.e., use the calculated W and b to predict y values for given x

  3. Compute the MSE MSE= 1/𝑛 ∑(𝑦𝑖−ℎ(𝑥𝑖))²

Example Calculation
  1. Compute the means:

    • Means for x and y variables
      xˉ=1+2+3+4+55=3\bar{x} = \frac{1 + 2 + 3 + 4 + 5}{5} = 3
      yˉ=2+2.8+3.6+4.5+5.15=3.6\bar{y} = \frac{2 + 2.8 + 3.6 + 4.5 + 5.1}{5} = 3.6

  2. Compute W (Slope)

    • Given the means calculated above and plugging into the slope equation, you should be able to calulate W

    W=7.910=0.79W = \frac{7.9}{10} = 0.79

  3. Compute b (intercept)

    • Compute the Intercept b
      b=yˉWxˉ=3.6(0.79×3)=1.23b = \bar{y} - W \bar{x} = 3.6 - (0.79 \times 3) = 1.23

  4. Write the final hypothesis

    • Complete Hypothesis Function
      h(x)=0.79x+1.23h(x) = 0.79x + 1.23

  5. Make Prediction

    • Given the linear regression equation we can plug in values of x and determine the predicted values of y

  6. Use MSE (without 1/2 factor) for evaluation

    • Calculating the Mean sqaure error of the hypothesis

What if you have two or more x features?

  • How to predict a model with multiple features

Example in calculation

  • Calculating the weight (W=(XTX)1XTY)(W = (X^TX)^{-1} X^T Y) with multiple features requires multiple steps that are laid out within the example. Most steps are best done with computation, not by hand.

Ridge & Lasso Regression

  • Another extension of LR, aim to prevent overfitting

  • When dataset has many potential explanatory variables

  • For e.g.: dataset used by financial institution to predict which potential candidates are likely to make their loan payment

  • Some variables are collinear

  • Not clear which variables are significant predictors of the target variable

Ridge Regression

  • To prevent overfitting by adding an L2 penalty (or L2 norm, which is the Euclidean) to the coefficient J(w)=MSE+λw2J(w) = MSE + \lambda \sum w^2

    • λ\lambda = penalty term, pre chosen constant

    • If takes on large value, the optimization function is penalized

    • For some constant c >0, \sum w^2 < c, i.e. constraining the sum of squared coefficients

Shrinking Coefficients
  • The add on function, shrinks coefficient but does not force them to be zero (it retains all features)

  • Best use when there are many correlated independent variables (multicollinearity)

  • Cross validation is often used in choosing λ\lambda, where we select λ\lambda that yields the smallest cross validation prediction error

Lasso Regression

  • Lasso regression: LR with an absolute loss

  • To improve feature selection by adding an L1 penalty (L1 norm)
    J(w)=MSE+λwJ(w) = MSE + \lambda \sum |w|

  • Forces some coefficient to be exactly zero

  • Some coefficient may be shrunk to zero

  • When a feature is weakly related to the target variable, Lasso pulls to zero faster than ridge regression

  • Best use when want to automate feature selection and to have a sparse model (most of the parameters (or weights) are zero, means that only small subset of features or variables significantly contribute to model’s prediction)