Machine Learning: Linear Models

Linear Models Overview

Linear models are foundational techniques in machine learning.
They can be divided into two major categories: Regression and Classification.

I. Linear Regression

A. History

A highly popular statistical technique rooted in historical work:
Gauss (1795): method of least squares.
Legendre (1805): independently re-invented least squares method.
Initially applied in fields like astronomy to analyze large-scale data.

B. Definition

Linear regression is used to predict continuous outcomes (real values).
Formally: a regression function f maps input features (x) (in (R^d)) to output values (y) (in (R)).

C. Matrix Representation & Normal Equation Method

Given training data: ((x1, y1), (x2, y2), …, (xn, yn))

Where (xi \in R^d) and (yi \in R).

The goal: minimize the loss function using the formula:
[ R(\beta) = \frac{1}{2n} \sum{i=1}^{n} (yi - f(x_i))^2 ]
Using the Normal Equation:

The coefficients are found using:
[ \hat{\beta} = (X^TX)^{-1}X^Ty ]
Where (X) is the feature matrix.

D. Iterative Method: Gradient Descent

An optimization algorithm that updates model coefficients iteratively.

Update rules:

(\beta0 := \beta0 - \alpha \frac{\partial R}{\partial \beta_0})
(\beta1 := \beta1 - \alpha \frac{\partial R}{\partial \beta_1})
(\alpha) is the learning rate, controlling the update step size.

II. Linear Classification: Perceptron

A. Definition and History

The perceptron is the simplest form of a neural network used for binary classification.
It makes its predictions based on a linear predictor function combining a set of weights with the feature vector.

B. Example

Classification tasks involve determining which of two categories an input belongs to.
Commonly simplified as predicting binary outcomes (e.g., {1, -1}).

C. Algorithm

Initialize weights randomly.
For each training sample, predict the label and adjust weights based on prediction error.
Repeat until convergence on classification accuracy.

III. Practical Considerations

A. Scaling Features

Ensure all input features are on a similar scale to improve model performance:
Standardize using mean and standard deviation or min-max scaling.

B. Learning Rate

Choose an appropriate learning rate to ensure convergence and stability of the model:

Too small may lead to slow convergence.
Too large may cause overshooting.

C. Monitor Loss Function

The loss function (R) should decrease over iterations, indicating that the model is learning effectively.
Declare convergence if reductions fall below a defined threshold ((\epsilon)).

D. If Invertibility Issues Occur

Situations where (X^TX) is not invertible may arise:

More features than examples (high-dimensional space).
Linear dependencies among features (e.g., correlated variables).

IV. Pros and Cons of Approaches

A. Normal Equation Method

Pros:
Direct computation avoids iteration.
Generally produces a closed-form solution.
Cons:
Computationally expensive (O(d³) complexity) for large (d).
Only applicable if (X^TX) is invertible.

B. Gradient Descent

Pros:
Scales well with large datasets and high dimensions.
Can handle non-invertible situations effectively.
Cons:
May require careful tuning of the learning rate.
Needs multiple iterations to assure convergence.

V. References

Elements of Statistical Learning: Data Mining, Inference, and Prediction - Hastie, Tibshirani, Friedman.
Machine Learning - Tom Mitchell (1997).