Machine Learning: Linear Models

Linear Models Overview

  • Linear models are foundational techniques in machine learning.
  • They can be divided into two major categories: Regression and Classification.

I. Linear Regression

A. History
  • A highly popular statistical technique rooted in historical work:
  • Gauss (1795): method of least squares.
  • Legendre (1805): independently re-invented least squares method.
  • Initially applied in fields like astronomy to analyze large-scale data.
B. Definition
  • Linear regression is used to predict continuous outcomes (real values).
  • Formally: a regression function f maps input features (x) (in (R^d)) to output values (y) (in (R)).
C. Matrix Representation & Normal Equation Method
  1. Given training data: ((x1, y1), (x2, y2), …, (xn, yn))
  • Where (xi \in R^d) and (yi \in R).
  1. The goal: minimize the loss function using the formula:
    [ R(\beta) = \frac{1}{2n} \sum{i=1}^{n} (yi - f(x_i))^2 ]
  2. Using the Normal Equation:
  • The coefficients are found using:
    [ \hat{\beta} = (X^TX)^{-1}X^Ty ]
  • Where (X) is the feature matrix.
D. Iterative Method: Gradient Descent
  • An optimization algorithm that updates model coefficients iteratively.
  1. Update rules:
  • (\beta0 := \beta0 - \alpha \frac{\partial R}{\partial \beta_0})
  • (\beta1 := \beta1 - \alpha \frac{\partial R}{\partial \beta_1})
  • (\alpha) is the learning rate, controlling the update step size.

II. Linear Classification: Perceptron

A. Definition and History
  • The perceptron is the simplest form of a neural network used for binary classification.
  • It makes its predictions based on a linear predictor function combining a set of weights with the feature vector.
B. Example
  • Classification tasks involve determining which of two categories an input belongs to.
  • Commonly simplified as predicting binary outcomes (e.g., {1, -1}).
C. Algorithm
  1. Initialize weights randomly.
  2. For each training sample, predict the label and adjust weights based on prediction error.
  3. Repeat until convergence on classification accuracy.

III. Practical Considerations

A. Scaling Features
  • Ensure all input features are on a similar scale to improve model performance:
  • Standardize using mean and standard deviation or min-max scaling.
B. Learning Rate
  • Choose an appropriate learning rate to ensure convergence and stability of the model:
  1. Too small may lead to slow convergence.
  2. Too large may cause overshooting.
C. Monitor Loss Function
  • The loss function (R) should decrease over iterations, indicating that the model is learning effectively.
  • Declare convergence if reductions fall below a defined threshold ((\epsilon)).
D. If Invertibility Issues Occur
  1. Situations where (X^TX) is not invertible may arise:
  • More features than examples (high-dimensional space).
  • Linear dependencies among features (e.g., correlated variables).

IV. Pros and Cons of Approaches

A. Normal Equation Method
  • Pros:
  • Direct computation avoids iteration.
  • Generally produces a closed-form solution.
  • Cons:
  • Computationally expensive (O(d³) complexity) for large (d).
  • Only applicable if (X^TX) is invertible.
B. Gradient Descent
  • Pros:
  • Scales well with large datasets and high dimensions.
  • Can handle non-invertible situations effectively.
  • Cons:
  • May require careful tuning of the learning rate.
  • Needs multiple iterations to assure convergence.

V. References

  • Elements of Statistical Learning: Data Mining, Inference, and Prediction - Hastie, Tibshirani, Friedman.
  • Machine Learning - Tom Mitchell (1997).