Regression Notes

Regression

Regression Overview

Regression is an important machine learning problem that serves as a good starting point for understanding the field.
Regression models are used to predict a continuous value.
It is a statistical method to model the relationship between a dependent variable (target) and one or more independent variables (predictors).
Example: Predicting house prices based on features like size.

Types of Regression

Simple Linear Regression
Multiple Linear Regression
Polynomial Regression
Logistic Regression
Ridge Regression
Lasso Regression
Elastic Net Regression
Support Vector Regression
Decision Tree Regression
And many more…

Linear Regression

Useful for finding the relationship between two continuous variables: a predictor (independent) and a response (dependent) variable.
It looks for a statistical, not deterministic, relationship.

Deterministic vs. Statistical Relationship

Deterministic Relationship: One variable can be accurately expressed by the other. For example, converting temperature from Celsius to Fahrenheit.
Statistical Relationship: Not accurate in determining the relationship between two variables. For example, the relationship between height and weight.

Core Idea

Obtain a straight line or hyperplane that best fits a set of data points.
The best fit line minimizes the total prediction error (residual error), which is the distance between the point and the regression line.

Example: Cricket Chirps and Temperature

Crickets chirp more frequently on hotter days.
Goal: Learn a model to predict this relationship using collected data on chirps-per-minute and temperature.

Visualizing the Data

Plot the data to examine the relationship between chirps and temperature.
Determine if the relationship appears linear.

Regression Line/Hyperplane

Represents the linear relationship between the variables.
Data points are scattered around this line.
Residual error is the difference between actual and predicted values.

Equation of a Linear Model

In machine learning, the equation is written as: $y' = w1x1 + b$
- $y'$ is the predicted label (output).
- $w_1$ is the weight of feature 1 (same as the slope).
- $x_1$ is feature 1 (known input).
- $b$ is the bias (y-intercept), sometimes referred to as $w_0$

Using the Model for Prediction

To predict the temperature ( $y'$ ) for a new chirps-per-minute value ( $x_1$ ), substitute the value into the model.

Models with Multiple Features

More sophisticated models can use multiple features, each with a separate weight ( $w1, w2$ , etc.).
Example: Predicting house price based on living area, number of bedrooms, and number of bathrooms:
$y' = w1x1 + w2x2 + w3x3 + b$

Training and Loss

Training a Model

Training involves learning (determining) good values for all the weights and the bias from labeled examples.
The goal is to represent the function/hypothesis in a computer such that input x is well mapped to output y.
Machine learning represents the equation as: $h\theta(x) = w1x$

Empirical Risk Minimization

In supervised learning, the algorithm builds a model (hypothesis/function) by examining many examples (x inputs and corresponding y outputs).
It attempts to find a model that minimizes loss, called empirical risk minimization.

Loss Function

Measures the difference between actual values and predicted values.
Loss is the penalty for a bad prediction.
If the model's prediction is perfect, the loss is zero; otherwise, the loss is greater.

Cost Function

The process of minimizing the loss involves choosing the optimal parameters W and b.
The goal is to find a set of weights and biases that have low loss, on average, across all examples.
The function that measures the average loss over all training data points is defined as the Cost Function.
High loss models have larger errors (arrows representing loss).
Blue lines represent predictions.

Loss Function and Normal Distribution

Linear regression assumes the actual output y is normally distributed around the predicted value $h_\theta(x)$ (or y').
The loss function can be obtained by using Maximum Likelihood Estimation (MLE).
Conceptually, maximize MLE = minimizing the cost function that is defined by using mean square error (MSE).

Squared Loss (L2 Loss)

The squared loss for a single example is:
Why squared?
- Emphasizes larger errors (big mistakes are penalized more).
- Ensures errors do not cancel out (negative errors won’t offset positive ones).

Mean Square Error (MSE)

MSE is the average squared loss per example over the whole dataset.
To calculate MSE, sum up all the squared losses for individual examples and then divide by the number of examples: $MSE = \frac{1}{n} \sum{(x,y) \in D} (y - h\theta(x))^2$
- (x, y) is an example in the dataset D.
- x is the set of features.
- y is the example's label.
- $h_\theta(x)$ is a function of the weights and bias in combination with the set of features x.
- D is a data set containing many labeled examples (x, y) pairs.
- n is the number of examples in D.

Factor in Cost Function

Many formulations of the cost function introduce a 1/2 factor in optimization:
$MSE = \frac{1}{2n} \sum{(x,y) \in D} (y - h\theta(x))^2$
Reasons:
- The factor 1/2 cancels out the factor 2 in gradient derivation (simplifies gradient descent updates).
- 1/2 is a constant multiplier; it does not change the optimal value but just scales the cost function.

Reducing Loss (Normal Equation)

To train a model, we need a good way to reduce the model’s loss.
Optimize the value of weight by using matrix algebra, i.e., ordinary least squares (OLS) solution.
We look forward to having the best weight value which can be used in the model to best fit the data, i.e., W which minimize the MSE.
The LR: $h(x) = XW + b$ , and we assume b is included in X (feature matrix) as a column with all the values is 1.
When derive the MSE to zero, $W = (X^TX)^{-1} X^T Y$
Using OLS, there is no need to go for iteration, but it is computationally expensive due to matrix inversion.
If datasets is huge, $(X^TX)$ becomes huge.

Reducing Loss - Iterative Approach

An iterative approach is one widely used method for reducing loss and is as easy and efficient as walking down a hill.
You'll start with a wild guess ("The value of $w_1$ is 0.") and wait for the system to tell you what the loss is.
Then, you'll try another guess ("The value of $w_1$ is 0.5.") and see what the loss is.
The real trick is trying to find the best possible model as efficiently as possible.
A Machine Learning model is trained by starting with an initial guess for the weights and bias and iteratively adjusting those guesses until learning the weights and bias with the lowest possible loss.
Usually, you iterate until overall loss stops changing, or at least changes extremely slowly.
When that happens, we say that the model has converged.

How to Reduce Loss

Derivative of $(y - y')^2$ with respect to the weights and biases tells us how loss changes for a given example.
Simple to compute and convex.
So we repeatedly take small steps in the direction that minimizes loss.
We call these Gradient Steps (But they're really negative Gradient Steps).
This strategy is called Gradient Descent.
Iteratively update W till it finds the best W which minimize the cost function.
It has a learning rate which control how much W is updated in each step of gradient descent.
- This value can be tuned.
- Too small then slow to converge.
- Too big may be overshoot the minimum (by pass the minimum).

Linear Regression Summarized

Hypothesis: $y' = w1x1 + w2x2 + … + b$ or $y' = w0x0 + w1x1 + w2x2 + … + wnxn$
Parameters: $w0, w1, w2, w3,…,$
Cost Function: $J(w0, w1,…) = \frac{1}{2} \sum_{(x,y) \in D}(y - y')^2$
Where J(W) = MSE
Goal: minimize $J(w0, w1,…, w_n)$

Polynomial Regression (Extension of LR)

Polynomial regression (PR): Extension of LR
Use when the complexity of a dataset exceeds the possibility of fitting performed using a straight line (obvious underfitting occurs if the original linear regression model is used).
Here, the nth power indicates the degree of the polynomial.
PR is a type of linear regression. Although its features are non-linear, the relationship between its weight parameters is still linear.

Flexibility and Overfitting

Flexibility of high-order polynomials
Overfitting: associated with very large estimated parameters weight

LR: Example in calculation

x y

1 2

2 2.8

3 3.6

4 4.5

5 5.1

Question:

Find the values of W and b in the hypothesis h() = + Using Least Square Method
Compute W =∑(𝑥𝑖−𝑥̄)(𝑦𝑖−𝑦̄) / ∑(𝑥𝑖−𝑥̄)²
Compute b = 𝑦̄ − 𝑊𝑥̄
Write the final hypothesis equation
Make predictions using the hypothesis, i.e., use the calculated W and b to predict y values for given x
Compute the MSE MSE= 1/𝑛 ∑(𝑦𝑖−ℎ(𝑥𝑖))²

Example Calculation

Compute the means:
- Means for x and y variables
  $\bar{x} = \frac{1 + 2 + 3 + 4 + 5}{5} = 3$
  $\bar{y} = \frac{2 + 2.8 + 3.6 + 4.5 + 5.1}{5} = 3.6$
Compute W (Slope)
- Given the means calculated above and plugging into the slope equation, you should be able to calulate W
$W = \frac{7.9}{10} = 0.79$
Compute b (intercept)
- Compute the Intercept b
  $b = \bar{y} - W \bar{x} = 3.6 - (0.79 \times 3) = 1.23$
Write the final hypothesis
- Complete Hypothesis Function
  $h(x) = 0.79x + 1.23$
Make Prediction
- Given the linear regression equation we can plug in values of x and determine the predicted values of y
Use MSE (without 1/2 factor) for evaluation
- Calculating the Mean sqaure error of the hypothesis

What if you have two or more x features?

How to predict a model with multiple features

Example in calculation

Calculating the weight $(W = (X^TX)^{-1} X^T Y)$ with multiple features requires multiple steps that are laid out within the example. Most steps are best done with computation, not by hand.

Ridge & Lasso Regression

Another extension of LR, aim to prevent overfitting
When dataset has many potential explanatory variables
For e.g.: dataset used by financial institution to predict which potential candidates are likely to make their loan payment
Some variables are collinear
Not clear which variables are significant predictors of the target variable

Ridge Regression

To prevent overfitting by adding an L2 penalty (or L2 norm, which is the Euclidean) to the coefficient $J(w) = MSE + \lambda \sum w^2$
- $\lambda$ = penalty term, pre chosen constant
- If takes on large value, the optimization function is penalized
- For some constant c >0, \sum w^2 < c, i.e. constraining the sum of squared coefficients

Shrinking Coefficients

The add on function, shrinks coefficient but does not force them to be zero (it retains all features)
Best use when there are many correlated independent variables (multicollinearity)
Cross validation is often used in choosing $\lambda$ , where we select $\lambda$ that yields the smallest cross validation prediction error

Lasso Regression

Lasso regression: LR with an absolute loss
To improve feature selection by adding an L1 penalty (L1 norm)
$J(w) = MSE + \lambda \sum |w|$
Forces some coefficient to be exactly zero
Some coefficient may be shrunk to zero
When a feature is weakly related to the target variable, Lasso pulls to zero faster than ridge regression
Best use when want to automate feature selection and to have a sparse model (most of the parameters (or weights) are zero, means that only small subset of features or variables significantly contribute to model’s prediction)