ML02-Linear regression

Page 1: Introduction to Linear Regression

Oslo Metropolitan UniversityLinear RegressionInstructor: Raju Shrestha

Page 2: Overview of Linear Regression

Definition: Linear regression is a supervised machine learning method used for predicting a continuous output value by establishing a linear relationship between input features and the output.
Example: Predicting house prices based on house size (in square meters) and price.

Page 3: Linear Regression Model

Input Feature:
- 𝑥 = house size (sqm)
Model Equation:
[ y' = \hat{h}(x) = wx + b ]
- Where:
  - 𝑤 = parameter (weight)
  - 𝑏 = bias
Hypothesis Function:
[ \hat{h}(x) = w_0 + w_1x ]
Output:
- 𝑦 = real/observed house price
- 𝑦' = predicted house price
Visualization: Linear regression model represents the relationship between house size and prices.

Page 4: Cost Function

Objective: Train the model using a set of labeled training data (m examples).
Goal: Determine model parameters 𝑤 (or 𝑤 and 𝑏) such that the estimated output 𝑦' is close to the real value 𝑦 for training samples (good fit).
Loss Function: A commonly used loss function is the least-squares error, given by:
[ J(w) = \frac{1}{2m} \sum_{i=1}^{m} (\hat{h}(x^{(i)}) - y^{(i)})^2 ]

Page 5: Optimization in Linear Regression

Goal: Minimize the loss function J(w):
[ \min_{w} J(w) ]
Types of Optimization Methods:
1. Iterative optimization (e.g. gradient descent)
2. Non-iterative method (using normal equations)

Page 6: Gradient Descent (GD) Optimization

Initialization: Assume initial parameter values, say 𝑤' = 0 for simplicity.
Gradient Calculation:
[
abla J = 0 ] if at minimum,
- (
  abla J > 0 ) indicates increasing function,
- (
  abla J < 0 ) indicates decreasing function.

Page 7: Gradient Descent Algorithm

Process:
- Start with initial weights (random or zeros).
- Update weights iteratively to reduce 𝐽(𝑤₀, 𝑤₁).
Update Rule:
[ w^j = w^j - \alpha
abla_{w^j} J(w) ]
- Where 𝛼 is the learning rate.
- An iteration is referred to as an epoch.

Page 8: Vectorization and Notation

Efficiency: Use linear algebra tools (Python, Matlab) to compute equations efficiently by vectorizing.
Matrix Equation Formulation:
- The equation can be expressed as:[ \mathbf{y} = \mathbf{Xw} ]
- Where
  - 𝐗 is the feature matrix,
  - 𝒘 is the weight vector,
  - 𝒚 is the output vector.

Page 9: Loss Function in Vectorized Form

Loss Function:
[ J(w) = \frac{1}{2m} | \mathbf{Xw} - \mathbf{y} |^2 ]
Gradient/Derivative:
[ \frac{\partial}{\partial w} J(w) = \frac{1}{m} \mathbf{X}^{\prime}(\mathbf{Xw} - \mathbf{y}) ]

Page 10: Gradient Descent in Plots

Visualization:
- Contour plots represent the loss function conveniently, illustrating current weight values and local optima.

Page 11: Variants of Gradient Descent

Types:
1. Batch Gradient Descent: Uses entire training samples in each iteration.
2. Mini-batch Gradient Descent: Learning occurs on mini-batch sample examples.

Page 12: Further Variants of GD

Stochastic Gradient Descent (SGD):
- Mini-batch with batch_size = 1, updates parameters on each example.
Trade-off: Between accuracy of gradient and computational complexity for each parameter update.

Page 13: Implementing GD with Scikit Learn

Code Snippet:

from sklearn.linear_model import SGDRegressor  
learning_rate = 0.01  
max_iters = 10000  
model = SGDRegressor(eta0=learning_rate, learning_rate='constant', max_iter=max_iters)  
model.fit(X, y)

Page 14: Advanced Optimization

Sophisticated Algorithms:
- Examples include CG (Conjugate Gradient), BFGS (Broyden-Fletcher-Goldfarb-Shannon), L-BFGS (Limited-memory BFGS).
Advantages:
- No manual selection of learning rate required, often faster convergence.
Disadvantages:
- More complex implementations.

Page 15: Advanced Optimization Implementation

Code Example:

import numpy as np  
from scipy.optimize import minimize  
X_with_1 = lambda X: np.column_stack([np.ones(X.shape[0]), X])  
predict = lambda W, X: X.dot(W)  
loss_fn = lambda W, X, y: np.sum((predict(W, X)-y)**2)/(2*len(y))  
gradient_fn = lambda W, X, y: X.T.dot(predict(W, X)-y)/len(y)  
W = [0,0]  
result = minimize(loss_fn, W, method='BFGS', jac=gradient_fn, args=(X, y))  
y_test = predict(X_with_1(X_test), result.x)

Page 16: Choosing the Learning Rate (𝜶)

Testing GD Effectiveness:
Plot Observations: Examine loss vs. number of iterations to ensure convergence.
Convergence Behavior:
- High learning rates may prevent decrease of loss, leading to divergence.
- Small rates ensure loss decreases at each step but may slow convergence.

Page 17: Learning Rate Recommendations

Appropriate Values: Can try a range of values like 0.001, 0.01, 0.1, 1, etc.
Impact Analysis:
- Small 𝛼 trims convergence speed.
- Large 𝛼 can cause diverging loss function.

Page 18: Training Workflow

Data Sampler: Prepare training data (features x; labels Y).
Train Model (learner): Update parameters.
Reporting: Iterate and evaluate based on output.
Decision Point: Determine if more training is required.

Page 19: Linear Regression with Multiple Variables

Model Structure:
[ \hat{h}(x) = w_1 x_1 + w_2 x_2 + ... + w_n x_n ]
- Each 𝑤 corresponds to a feature of the input.
Example Context:
- Variables such as size (sqm), number of floors, number of bedrooms, and age, affect price prediction.

Page 20: Feature Scaling

Importance of Scaling: Features can vary widely in range (e.g., age 0-120, income 50,000-1,000,000).
Normalization: Techniques to scale features (0-1 range or -1 to +1 range).

Page 21: Common Normalization Methods

Min-Max Normalization: Scale values to a [0, 1] range.
Standardization/Z-score: Normalize features (non-bounding range).
- Formulated as:

[ x' = \frac{x - \mu}{\sigma} ]
Where 𝜇 = mean and 𝜎 = standard deviation.

Page 22: Polynomial Regression

Feature Generation: New features can derive from existing ones.
Polynomial Regression Models:
- Example formulation:
  [ \hat{h}(x) = w_0 + w_1x + w_2x^2 + w_3x^3 ]
- Considerations: When and why to employ polynomial regression.

Page 23: Non-iterative Optimization: Normal Equation

Analytical Approach: Solve for parameters using linear algebra.
Equation Executor:
[ J(w) = \frac{1}{2m} | \mathbf{Xw} - \mathbf{y} |^2 ]
Derive Parameters: Solve normal equations for optimal weights.

Page 24: Solving Normal Equations

Solution from Derivation:
[ w = (X^TX)^{-1}X^Ty ]

Page 25: Gradient Descent vs. Normal Equation

Comparison:
- Gradient Descent: Requires choice of learning rate 𝛼, involves iterations.
- Normal Equation: Direct calculation, no need for learning rate or iterations, can be computationally intensive depending on n.
- Note: Non-invertibility scenarios may arise.

Page 26: Implementation Strategies

Recommendation: Writing code for learning clarity vs. utilizing ML libraries like Scikit-learn.

Page 27: Example Code with Scikit-learn

Build Model: Using the normal equation and gradient descent methods with Scikit-learn.

from sklearn.linear_model import LinearRegression  
model = LinearRegression()  
from sklearn.linear_model import SGDRegressor  
lr = 0.01  
max_iter = 10000  
model = SGDRegressor(eta0=lr, learning_rate='constant', max_iter=max_iters)

Page 28: Model Training and Testing

Training the Model:
```
model.fit(X, y)  
```
- Testing Prediction:
  - Example query:
```
X_test = [[50,1], [150,3]]  
y_test = model.predict(X_test)  
```
- Output Display: Formatted predictions from test data.

Page 29: Demo Resources

Demo Files:
- RSLinearRegression-GD.py: Training using GD and exploring different learning rates.
- RSLinearRegression-AdvancedOptm.py: Using BFGS optimized methods.
Access: Shared Google Drive link available in Canvas course page.

ML02-Linear regression

Page 1: Introduction to Linear Regression

Page 2: Overview of Linear Regression

Page 3: Linear Regression Model

Model Equation:

Hypothesis Function:

Page 4: Cost Function

Loss Function: A commonly used loss function is the least-squares error, given by:

Page 5: Optimization in Linear Regression

Goal: Minimize the loss function J(w):

Page 6: Gradient Descent (GD) Optimization

Gradient Calculation:

Page 7: Gradient Descent Algorithm

Update Rule:

Page 8: Vectorization and Notation

Page 9: Loss Function in Vectorized Form

Loss Function:

Gradient/Derivative:

Page 10: Gradient Descent in Plots

Page 11: Variants of Gradient Descent

Page 12: Further Variants of GD

Page 13: Implementing GD with Scikit Learn

Page 14: Advanced Optimization

Page 15: Advanced Optimization Implementation

Page 16: Choosing the Learning Rate (𝜶)

Page 17: Learning Rate Recommendations

Page 18: Training Workflow

Page 19: Linear Regression with Multiple Variables

Model Structure:

Page 20: Feature Scaling

Page 21: Common Normalization Methods

Page 22: Polynomial Regression

Example formulation:

Page 23: Non-iterative Optimization: Normal Equation

Equation Executor:

Page 24: Solving Normal Equations

Solution from Derivation:

Page 25: Gradient Descent vs. Normal Equation

Page 26: Implementation Strategies

Page 27: Example Code with Scikit-learn

Page 28: Model Training and Testing

Training the Model:

Page 29: Demo Resources