Machine Learning: Linear Models
Linear Models Overview
- Linear models are foundational techniques in machine learning.
- They can be divided into two major categories: Regression and Classification.
I. Linear Regression
A. History
- A highly popular statistical technique rooted in historical work:
- Gauss (1795): method of least squares.
- Legendre (1805): independently re-invented least squares method.
- Initially applied in fields like astronomy to analyze large-scale data.
B. Definition
- Linear regression is used to predict continuous outcomes (real values).
- Formally: a regression function f maps input features (x) (in (R^d)) to output values (y) (in (R)).
C. Matrix Representation & Normal Equation Method
- Given training data: ((x1, y1), (x2, y2), …, (xn, yn))
- Where (xi \in R^d) and (yi \in R).
- The goal: minimize the loss function using the formula:
[ R(\beta) = \frac{1}{2n} \sum{i=1}^{n} (yi - f(x_i))^2 ] - Using the Normal Equation:
- The coefficients are found using:
[ \hat{\beta} = (X^TX)^{-1}X^Ty ] - Where (X) is the feature matrix.
D. Iterative Method: Gradient Descent
- An optimization algorithm that updates model coefficients iteratively.
- Update rules:
- (\beta0 := \beta0 - \alpha \frac{\partial R}{\partial \beta_0})
- (\beta1 := \beta1 - \alpha \frac{\partial R}{\partial \beta_1})
- (\alpha) is the learning rate, controlling the update step size.
II. Linear Classification: Perceptron
A. Definition and History
- The perceptron is the simplest form of a neural network used for binary classification.
- It makes its predictions based on a linear predictor function combining a set of weights with the feature vector.
B. Example
- Classification tasks involve determining which of two categories an input belongs to.
- Commonly simplified as predicting binary outcomes (e.g., {1, -1}).
C. Algorithm
- Initialize weights randomly.
- For each training sample, predict the label and adjust weights based on prediction error.
- Repeat until convergence on classification accuracy.
III. Practical Considerations
A. Scaling Features
- Ensure all input features are on a similar scale to improve model performance:
- Standardize using mean and standard deviation or min-max scaling.
B. Learning Rate
- Choose an appropriate learning rate to ensure convergence and stability of the model:
- Too small may lead to slow convergence.
- Too large may cause overshooting.
C. Monitor Loss Function
- The loss function (R) should decrease over iterations, indicating that the model is learning effectively.
- Declare convergence if reductions fall below a defined threshold ((\epsilon)).
D. If Invertibility Issues Occur
- Situations where (X^TX) is not invertible may arise:
- More features than examples (high-dimensional space).
- Linear dependencies among features (e.g., correlated variables).
IV. Pros and Cons of Approaches
A. Normal Equation Method
- Pros:
- Direct computation avoids iteration.
- Generally produces a closed-form solution.
- Cons:
- Computationally expensive (O(d³) complexity) for large (d).
- Only applicable if (X^TX) is invertible.
B. Gradient Descent
- Pros:
- Scales well with large datasets and high dimensions.
- Can handle non-invertible situations effectively.
- Cons:
- May require careful tuning of the learning rate.
- Needs multiple iterations to assure convergence.
V. References
- Elements of Statistical Learning: Data Mining, Inference, and Prediction - Hastie, Tibshirani, Friedman.
- Machine Learning - Tom Mitchell (1997).