ML02-Linear regression
Page 1: Introduction to Linear Regression
Oslo Metropolitan UniversityLinear RegressionInstructor: Raju Shrestha
Page 2: Overview of Linear Regression
Definition: Linear regression is a supervised machine learning method used for predicting a continuous output value by establishing a linear relationship between input features and the output.
Example: Predicting house prices based on house size (in square meters) and price.
Page 3: Linear Regression Model
Input Feature:
𝑥 = house size (sqm)
Model Equation:
[ y' = \hat{h}(x) = wx + b ]
Where:
𝑤 = parameter (weight)
𝑏 = bias
Hypothesis Function:
[ \hat{h}(x) = w_0 + w_1x ]
Output:
𝑦 = real/observed house price
𝑦' = predicted house price
Visualization: Linear regression model represents the relationship between house size and prices.
Page 4: Cost Function
Objective: Train the model using a set of labeled training data (m examples).
Goal: Determine model parameters 𝑤 (or 𝑤 and 𝑏) such that the estimated output 𝑦' is close to the real value 𝑦 for training samples (good fit).
Loss Function: A commonly used loss function is the least-squares error, given by:
[ J(w) = \frac{1}{2m} \sum_{i=1}^{m} (\hat{h}(x^{(i)}) - y^{(i)})^2 ]
Page 5: Optimization in Linear Regression
Goal: Minimize the loss function J(w):
[ \min_{w} J(w) ]
Types of Optimization Methods:
Iterative optimization (e.g. gradient descent)
Non-iterative method (using normal equations)
Page 6: Gradient Descent (GD) Optimization
Initialization: Assume initial parameter values, say 𝑤' = 0 for simplicity.
Gradient Calculation:
[
abla J = 0 ] if at minimum,(
abla J > 0 ) indicates increasing function,(
abla J < 0 ) indicates decreasing function.
Page 7: Gradient Descent Algorithm
Process:
Start with initial weights (random or zeros).
Update weights iteratively to reduce 𝐽(𝑤₀, 𝑤₁).
Update Rule:
[ w^j = w^j - \alpha
abla_{w^j} J(w) ]Where 𝛼 is the learning rate.
An iteration is referred to as an epoch.
Page 8: Vectorization and Notation
Efficiency: Use linear algebra tools (Python, Matlab) to compute equations efficiently by vectorizing.
Matrix Equation Formulation:
The equation can be expressed as:[ \mathbf{y} = \mathbf{Xw} ]
Where
𝐗 is the feature matrix,
𝒘 is the weight vector,
𝒚 is the output vector.
Page 9: Loss Function in Vectorized Form
Loss Function:
[ J(w) = \frac{1}{2m} | \mathbf{Xw} - \mathbf{y} |^2 ]
Gradient/Derivative:
[ \frac{\partial}{\partial w} J(w) = \frac{1}{m} \mathbf{X}^{\prime}(\mathbf{Xw} - \mathbf{y}) ]
Page 10: Gradient Descent in Plots
Visualization:
Contour plots represent the loss function conveniently, illustrating current weight values and local optima.
Page 11: Variants of Gradient Descent
Types:
Batch Gradient Descent: Uses entire training samples in each iteration.
Mini-batch Gradient Descent: Learning occurs on mini-batch sample examples.
Page 12: Further Variants of GD
Stochastic Gradient Descent (SGD):
Mini-batch with batch_size = 1, updates parameters on each example.
Trade-off: Between accuracy of gradient and computational complexity for each parameter update.
Page 13: Implementing GD with Scikit Learn
Code Snippet:
from sklearn.linear_model import SGDRegressor
learning_rate = 0.01
max_iters = 10000
model = SGDRegressor(eta0=learning_rate, learning_rate='constant', max_iter=max_iters)
model.fit(X, y) Page 14: Advanced Optimization
Sophisticated Algorithms:
Examples include CG (Conjugate Gradient), BFGS (Broyden-Fletcher-Goldfarb-Shannon), L-BFGS (Limited-memory BFGS).
Advantages:
No manual selection of learning rate required, often faster convergence.
Disadvantages:
More complex implementations.
Page 15: Advanced Optimization Implementation
Code Example:
import numpy as np
from scipy.optimize import minimize
X_with_1 = lambda X: np.column_stack([np.ones(X.shape[0]), X])
predict = lambda W, X: X.dot(W)
loss_fn = lambda W, X, y: np.sum((predict(W, X)-y)**2)/(2*len(y))
gradient_fn = lambda W, X, y: X.T.dot(predict(W, X)-y)/len(y)
W = [0,0]
result = minimize(loss_fn, W, method='BFGS', jac=gradient_fn, args=(X, y))
y_test = predict(X_with_1(X_test), result.x) Page 16: Choosing the Learning Rate (𝜶)
Testing GD Effectiveness:
Plot Observations: Examine loss vs. number of iterations to ensure convergence.
Convergence Behavior:
High learning rates may prevent decrease of loss, leading to divergence.
Small rates ensure loss decreases at each step but may slow convergence.
Page 17: Learning Rate Recommendations
Appropriate Values: Can try a range of values like 0.001, 0.01, 0.1, 1, etc.
Impact Analysis:
Small 𝛼 trims convergence speed.
Large 𝛼 can cause diverging loss function.
Page 18: Training Workflow
Data Sampler: Prepare training data (features x; labels Y).
Train Model (learner): Update parameters.
Reporting: Iterate and evaluate based on output.
Decision Point: Determine if more training is required.
Page 19: Linear Regression with Multiple Variables
Model Structure:
[ \hat{h}(x) = w_1 x_1 + w_2 x_2 + ... + w_n x_n ]
Each 𝑤 corresponds to a feature of the input.
Example Context:
Variables such as size (sqm), number of floors, number of bedrooms, and age, affect price prediction.
Page 20: Feature Scaling
Importance of Scaling: Features can vary widely in range (e.g., age 0-120, income 50,000-1,000,000).
Normalization: Techniques to scale features (0-1 range or -1 to +1 range).
Page 21: Common Normalization Methods
Min-Max Normalization: Scale values to a [0, 1] range.
Standardization/Z-score: Normalize features (non-bounding range).
Formulated as:
[ x' = \frac{x - \mu}{\sigma} ]
Where 𝜇 = mean and 𝜎 = standard deviation.
Page 22: Polynomial Regression
Feature Generation: New features can derive from existing ones.
Polynomial Regression Models:
Example formulation:
[ \hat{h}(x) = w_0 + w_1x + w_2x^2 + w_3x^3 ]
Considerations: When and why to employ polynomial regression.
Page 23: Non-iterative Optimization: Normal Equation
Analytical Approach: Solve for parameters using linear algebra.
Equation Executor:
[ J(w) = \frac{1}{2m} | \mathbf{Xw} - \mathbf{y} |^2 ]
Derive Parameters: Solve normal equations for optimal weights.
Page 24: Solving Normal Equations
Solution from Derivation:
[ w = (X^TX)^{-1}X^Ty ]
Page 25: Gradient Descent vs. Normal Equation
Comparison:
Gradient Descent: Requires choice of learning rate 𝛼, involves iterations.
Normal Equation: Direct calculation, no need for learning rate or iterations, can be computationally intensive depending on n.
Note: Non-invertibility scenarios may arise.
Page 26: Implementation Strategies
Recommendation: Writing code for learning clarity vs. utilizing ML libraries like Scikit-learn.
Page 27: Example Code with Scikit-learn
Build Model: Using the normal equation and gradient descent methods with Scikit-learn.
from sklearn.linear_model import LinearRegression
model = LinearRegression()
from sklearn.linear_model import SGDRegressor
lr = 0.01
max_iter = 10000
model = SGDRegressor(eta0=lr, learning_rate='constant', max_iter=max_iters) Page 28: Model Training and Testing
Training the Model:
model.fit(X, y)Testing Prediction:
Example query:
X_test = [[50,1], [150,3]] y_test = model.predict(X_test)Output Display: Formatted predictions from test data.
Page 29: Demo Resources
Demo Files:
RSLinearRegression-GD.py: Training using GD and exploring different learning rates.
RSLinearRegression-AdvancedOptm.py: Using BFGS optimized methods.
Access: Shared Google Drive link available in Canvas course page.