ML02-Linear regression

Page 1: Introduction to Linear Regression

Oslo Metropolitan UniversityLinear RegressionInstructor: Raju Shrestha


Page 2: Overview of Linear Regression

  • Definition: Linear regression is a supervised machine learning method used for predicting a continuous output value by establishing a linear relationship between input features and the output.

  • Example: Predicting house prices based on house size (in square meters) and price.


Page 3: Linear Regression Model

  • Input Feature:

    • 𝑥 = house size (sqm)

  • Model Equation:

    [ y' = \hat{h}(x) = wx + b ]

    • Where:

      • 𝑤 = parameter (weight)

      • 𝑏 = bias

  • Hypothesis Function:

    [ \hat{h}(x) = w_0 + w_1x ]

  • Output:

    • 𝑦 = real/observed house price

    • 𝑦' = predicted house price

  • Visualization: Linear regression model represents the relationship between house size and prices.


Page 4: Cost Function

  • Objective: Train the model using a set of labeled training data (m examples).

  • Goal: Determine model parameters 𝑤 (or 𝑤 and 𝑏) such that the estimated output 𝑦' is close to the real value 𝑦 for training samples (good fit).

  • Loss Function: A commonly used loss function is the least-squares error, given by:

    [ J(w) = \frac{1}{2m} \sum_{i=1}^{m} (\hat{h}(x^{(i)}) - y^{(i)})^2 ]


Page 5: Optimization in Linear Regression

  • Goal: Minimize the loss function J(w):

    [ \min_{w} J(w) ]

  • Types of Optimization Methods:

    1. Iterative optimization (e.g. gradient descent)

    2. Non-iterative method (using normal equations)


Page 6: Gradient Descent (GD) Optimization

  • Initialization: Assume initial parameter values, say 𝑤' = 0 for simplicity.

  • Gradient Calculation:

    [
    abla J = 0 ] if at minimum,

    • (
      abla J > 0 ) indicates increasing function,

    • (
      abla J < 0 ) indicates decreasing function.


Page 7: Gradient Descent Algorithm

  • Process:

    • Start with initial weights (random or zeros).

    • Update weights iteratively to reduce 𝐽(𝑤₀, 𝑤₁).

  • Update Rule:

    [ w^j = w^j - \alpha
    abla_{w^j} J(w) ]

    • Where 𝛼 is the learning rate.

    • An iteration is referred to as an epoch.


Page 8: Vectorization and Notation

  • Efficiency: Use linear algebra tools (Python, Matlab) to compute equations efficiently by vectorizing.

  • Matrix Equation Formulation:

    • The equation can be expressed as:[ \mathbf{y} = \mathbf{Xw} ]

    • Where

      • 𝐗 is the feature matrix,

      • 𝒘 is the weight vector,

      • 𝒚 is the output vector.


Page 9: Loss Function in Vectorized Form

  • Loss Function:

    [ J(w) = \frac{1}{2m} | \mathbf{Xw} - \mathbf{y} |^2 ]

  • Gradient/Derivative:

    [ \frac{\partial}{\partial w} J(w) = \frac{1}{m} \mathbf{X}^{\prime}(\mathbf{Xw} - \mathbf{y}) ]


Page 10: Gradient Descent in Plots

  • Visualization:

    • Contour plots represent the loss function conveniently, illustrating current weight values and local optima.


Page 11: Variants of Gradient Descent

  • Types:

    1. Batch Gradient Descent: Uses entire training samples in each iteration.

    2. Mini-batch Gradient Descent: Learning occurs on mini-batch sample examples.


Page 12: Further Variants of GD

  • Stochastic Gradient Descent (SGD):

    • Mini-batch with batch_size = 1, updates parameters on each example.

  • Trade-off: Between accuracy of gradient and computational complexity for each parameter update.


Page 13: Implementing GD with Scikit Learn

  • Code Snippet:

from sklearn.linear_model import SGDRegressor  
learning_rate = 0.01  
max_iters = 10000  
model = SGDRegressor(eta0=learning_rate, learning_rate='constant', max_iter=max_iters)  
model.fit(X, y)  

Page 14: Advanced Optimization

  • Sophisticated Algorithms:

    • Examples include CG (Conjugate Gradient), BFGS (Broyden-Fletcher-Goldfarb-Shannon), L-BFGS (Limited-memory BFGS).

  • Advantages:

    • No manual selection of learning rate required, often faster convergence.

  • Disadvantages:

    • More complex implementations.


Page 15: Advanced Optimization Implementation

  • Code Example:

import numpy as np  
from scipy.optimize import minimize  
X_with_1 = lambda X: np.column_stack([np.ones(X.shape[0]), X])  
predict = lambda W, X: X.dot(W)  
loss_fn = lambda W, X, y: np.sum((predict(W, X)-y)**2)/(2*len(y))  
gradient_fn = lambda W, X, y: X.T.dot(predict(W, X)-y)/len(y)  
W = [0,0]  
result = minimize(loss_fn, W, method='BFGS', jac=gradient_fn, args=(X, y))  
y_test = predict(X_with_1(X_test), result.x)  

Page 16: Choosing the Learning Rate (𝜶)

  • Testing GD Effectiveness:

  • Plot Observations: Examine loss vs. number of iterations to ensure convergence.

  • Convergence Behavior:

    • High learning rates may prevent decrease of loss, leading to divergence.

    • Small rates ensure loss decreases at each step but may slow convergence.


Page 17: Learning Rate Recommendations

  • Appropriate Values: Can try a range of values like 0.001, 0.01, 0.1, 1, etc.

  • Impact Analysis:

    • Small 𝛼 trims convergence speed.

    • Large 𝛼 can cause diverging loss function.


Page 18: Training Workflow

  1. Data Sampler: Prepare training data (features x; labels Y).

  2. Train Model (learner): Update parameters.

  3. Reporting: Iterate and evaluate based on output.

  4. Decision Point: Determine if more training is required.


Page 19: Linear Regression with Multiple Variables

  • Model Structure:

    [ \hat{h}(x) = w_1 x_1 + w_2 x_2 + ... + w_n x_n ]

    • Each 𝑤 corresponds to a feature of the input.

  • Example Context:

    • Variables such as size (sqm), number of floors, number of bedrooms, and age, affect price prediction.


Page 20: Feature Scaling

  • Importance of Scaling: Features can vary widely in range (e.g., age 0-120, income 50,000-1,000,000).

  • Normalization: Techniques to scale features (0-1 range or -1 to +1 range).


Page 21: Common Normalization Methods

  1. Min-Max Normalization: Scale values to a [0, 1] range.

  2. Standardization/Z-score: Normalize features (non-bounding range).

    • Formulated as:

  • [ x' = \frac{x - \mu}{\sigma} ]

  • Where 𝜇 = mean and 𝜎 = standard deviation.


Page 22: Polynomial Regression

  • Feature Generation: New features can derive from existing ones.

  • Polynomial Regression Models:

    • Example formulation:

      [ \hat{h}(x) = w_0 + w_1x + w_2x^2 + w_3x^3 ]

    • Considerations: When and why to employ polynomial regression.


Page 23: Non-iterative Optimization: Normal Equation

  • Analytical Approach: Solve for parameters using linear algebra.

  • Equation Executor:

    [ J(w) = \frac{1}{2m} | \mathbf{Xw} - \mathbf{y} |^2 ]

  • Derive Parameters: Solve normal equations for optimal weights.


Page 24: Solving Normal Equations

  • Solution from Derivation:

    [ w = (X^TX)^{-1}X^Ty ]


Page 25: Gradient Descent vs. Normal Equation

  • Comparison:

    • Gradient Descent: Requires choice of learning rate 𝛼, involves iterations.

    • Normal Equation: Direct calculation, no need for learning rate or iterations, can be computationally intensive depending on n.

    • Note: Non-invertibility scenarios may arise.


Page 26: Implementation Strategies

  • Recommendation: Writing code for learning clarity vs. utilizing ML libraries like Scikit-learn.


Page 27: Example Code with Scikit-learn

  • Build Model: Using the normal equation and gradient descent methods with Scikit-learn.

from sklearn.linear_model import LinearRegression  
model = LinearRegression()  
from sklearn.linear_model import SGDRegressor  
lr = 0.01  
max_iter = 10000  
model = SGDRegressor(eta0=lr, learning_rate='constant', max_iter=max_iters)  

Page 28: Model Training and Testing

  • Training the Model:

    model.fit(X, y)  
    • Testing Prediction:

      • Example query:

      X_test = [[50,1], [150,3]]  
      y_test = model.predict(X_test)  
    • Output Display: Formatted predictions from test data.


Page 29: Demo Resources

  • Demo Files:

    • RSLinearRegression-GD.py: Training using GD and exploring different learning rates.

    • RSLinearRegression-AdvancedOptm.py: Using BFGS optimized methods.

  • Access: Shared Google Drive link available in Canvas course page.