Dimensionality Reduction in Machine Learning

Lecture Overview

The lecture exploring dimensionality reduction techniques has been successfully recorded and is available for review, covering essential concepts and practical applications.

Announcements

Upcoming Quiz 4 and Assignment 5 are scheduled. Students are advised to prepare by reviewing the materials covered in recent lectures, especially those pertaining to data preprocessing and dimensionality reduction.

Topic: Dimensionality Reduction

This lecture, part of SEIS 763, Fall 2025, focuses on techniques that reduce the number of features in a dataset while retaining crucial information, which is vital for efficient model training and interpretability.

Introduction to Dimensionality Reduction

Prior to this topic, key aspects of data preprocessing covered include:
- Data manipulation techniques:
  - Imputing: The process of filling in missing values within the dataset. Strategies include mean, median, mode imputation, or more advanced methods like K-Nearest Neighbors (KNN) imputation, chosen based on the data distribution and the nature of the missingness.
  - Scaling: Normalizing or standardizing data to a standard scale to prevent features with larger values from dominating the learning process. Common methods include Min-Max scaling and Z-score standardization.
- Thus far, the focus has been on manipulating individual features without comprehensively addressing the total count of features, which can become problematic in high-dimensional datasets.

Feature Selection vs. Feature Extraction

Feature Selection:
- This is the process of selecting a subset of relevant features from the original set without altering them. The goal is to identify features that contribute most to the model's performance and eliminate irrelevant or redundant ones.
- Techniques involved:
  - Forward selection: Begins with an empty set of features and iteratively adds the feature that provides the most significant improvement in model performance until no further meaningful improvement is observed.
  - Backward selection: Starts with all features and iteratively removes the least significant feature (the one whose removal least impacts model performance) until a subset of optimal features remains.
  - Stepwise selection: A combination of forward and backward selection, allowing features to be added or removed at each step to find the most optimal subset efficiently.
- Result: The final features are a direct subset of the original features; no new features are created. This means the interpretability of features is preserved.
Feature Extraction:
- This involves transforming or projecting the data from a high-dimensional space into a new feature space with a significantly reduced dimensionality. The new features (components or latent variables) are typically linear or non-linear combinations of the original features.
- The primary objective is to minimize information loss during this transformation, ensuring that the new, lower-dimensional representation still captures the essential patterns and variance of the original data.

Why Reduce the Number of Features?

Reducing the number of features offers several significant benefits, especially in machine learning and data analysis:
- Reducing computational load: Fewer features mean faster training times for models and less memory usage, making algorithms more efficient, particularly for large datasets.
- Removing irrelevant and redundant features: Eliminating features that do not contribute to the prediction target or are highly correlated with other features can simplify the model and improve its robustness.
- Mitigating the curse of dimensionality: In high-dimensional spaces, data points become sparse, making it difficult for models to find meaningful patterns. Dimensionality reduction helps to combat this sparsity.
- Potentially improving model performance: By focusing on the most informative features and reducing noise, models can generalize better to unseen data and achieve higher accuracy.
- Enhanced data visualization: High-dimensional data is impossible to visualize directly. Reducing dimensions to two or three allows for meaningful plots and insights.

Approaches to Dimensionality Reduction

1. Principal Component Analysis (PCA)

Characteristics: An unsupervised technique, meaning it does not use class labels during the reduction process. PCA identifies new orthogonal dimensions (principal components) along which the data varies the most. It projects data in the direction of the largest variance, seeking to explain the maximum amount of variance in the data with the fewest features.
Example Applications: Widely used in image compression, noise reduction, and face recognition using eigenfaces, where the principal components represent the most significant variations in facial images (e.g., Turk et al.'s work).

2. Linear Discriminant Analysis (LDA)

Characteristics: A supervised technique that explicitly uses class labels to find a projection that maximizes the separability between different classes while minimizing the spread within each class. It aims to find the optimal linear transformation that best discriminates between known classes.

3. Kernel PCA (KPCA)

Characteristics: An extension of PCA that enables non-linear dimensionality reduction. KPCA uses kernel functions (e.g., radial basis function, polynomial kernel) to implicitly map the data into a higher-dimensional feature space, where it can then apply standard PCA to find non-linear principal components. This makes it suitable for complex datasets where linear separation or variance maximization is insufficient.

Detailed Explanation of PCA

PCA achieves dimensionality reduction by:
- Rotating the dataset such that the new, rotated features (principal components) become statistically uncorrelated. This means that each principal component captures unique variance that is not explained by the others.
- Selecting features based on their importance from the rotated subset. The importance is determined by the amount of variance each principal component explains, with the first principal component explaining the most variance, the second the second most, and so on.

2D Example Analysis

Consider a dataset involving age and height: if these two variables are highly correlated (e.g., taller individuals tend to be older in a specific group), projecting them onto their principal components can simplify the representation.
- Present age and height data points might be hard to separate or visualize as individual features if their combined effect is more informative. By projecting the data onto the principal components, we can find a new axis (a linear combination of age and height) where the data points are most spread out, making their underlying structure more apparent.

Determining the Number of Principal Components

Find the direction of maximum variance for the first principal component. This direction (eigenvector) captures the most spread in the data.
Identify the orthogonal direction (at 90 degrees to the first) representing the maximum remaining information (variance) in higher dimensions. This ensures that each principal component captures distinct information.
Repeat the process for subsequent components, always selecting an orthogonal direction that explains the most remaining variance, until all dimensions or a desired number of components are captured.

PCA Steps

Standardize the original d-dimensional dataset. This step is crucial to ensure that features with larger numerical ranges do not disproportionately influence the principal components, giving all features equal weighting.
Construct the covariance matrix from the standardized dataset. The covariance matrix quantifies the degree to which and how two features vary together. For a d-dimensional dataset, it will be a d \times d symmetric matrix.
Decompose the covariance matrix into eigenvectors and eigenvalues. Eigenvectors represent the directions (principal components) along which the data has the most variance, and eigenvalues represent the magnitude of this variance along those directions.
Sort eigenvalues in decreasing order to rank their corresponding eigenvectors. Eigenvectors associated with larger eigenvalues are the principal components that capture more variance in the data and are thus more important.
Select k eigenvectors corresponding to the k-largest eigenvalues (where k \le d). The choice of k (the target dimensionality) can be determined by methods such as examining a scree plot or selecting components that explain a cumulative percentage of total variance (e.g., 95%).
Construct a projection matrix W from the top k eigenvectors. This matrix W (of shape d \times k) acts as a transformation matrix.
Transform the original dataset X (shape n \times d) into a new k-dimensional feature subspace using the projection matrix W. The new dataset X' (shape n \times k) is the lower-dimensional representation of the original data, where each column is a principal component (X' = XW).

Variance and Covariance Relations

Variance measures how much a single variable deviates from its mean:
V = \frac{\Sigma(x_i - \bar{x})^2}{n}
Covariance measures the extent to which two variables change together. A positive covariance indicates that the variables tend to increase or decrease simultaneously, while a negative covariance suggests an inverse relationship:
V{xy} = \frac{\Sigma(xi - \bar{x})(y_i - \bar{y})}{n}

Construction of Covariance Matrix

Covariance matrices characterize the variance within each feature and the covariance between every pair of features, providing a complete picture of their linear relationships.
- For two variables X and Y, the covariance matrix is symmetrical and shows:
  \begin{bmatrix} Var(X) & Cov(X,Y) \ Cov(Y,X) & Var(Y) \end{bmatrix}
  Where: Var(X) = 1.67, Var(Y) = 6.67, Cov(X,Y) = 3.33 (example values).

Eigen Decomposition

Eigenvalues and eigenvectors are fundamental to PCA computations as they represent the directions and magnitudes of variance in the data.
The relationship is expressed as: Ax = \lambda x
- Where A is a square matrix (typically the covariance matrix), x is a non-zero eigenvector, and \lambda (lambda) is the eigenvalue associated with that eigenvector.
- The process involves finding eigenvalues by solving the characteristic polynomial det(A - \lambda I) = 0, where I is the identity matrix. Each eigenvalue represents the amount of variance captured by its corresponding eigenvector (principal component).

Example of Eigenvalues

For a given matrix A = \begin{bmatrix} 2 & 1 \ 1 & 2 \end{bmatrix}, the calculated eigenvalues are \lambda = 3 and \lambda = 1. These values indicate the magnitude of variance explained along the directions specified by their respective eigenvectors.

Use of LDA (Linear Discriminant Analysis)

LDA focuses on supervised dimensionality reduction by maximizing the separability between classes. Unlike PCA, which seeks directions of maximum variance irrespective of class labels, LDA aims to find a projection that makes classes as distinct as possible.
- It achieves this by constructing:
  - SB (Between-class scatter matrix): Measures the dispersion between the mean vectors of different classes. A larger SB indicates greater separation between classes.
  - SW (Within-class scatter matrix): Measures the total scatter of samples within each class, relative to their respective class means. A smaller SW indicates that samples within each class are tightly clustered.

LDA Steps

Standardize the dataset. Similar to PCA, this step ensures that features contribute equally to the distance metrics.
Compute mean vectors for each class. These mean vectors (centroids) are crucial for calculating the between-class scatter.
Construct scatter matrices:
- S_B
- S_W
These matrices quantify the variance between classes and within classes, respectively.
Compute the eigenvalues/eigenvectors of the matrix SW^{-1}SB. This matrix represents the ratio of between-class scatter to within-class scatter, and its eigenvectors define the linear discriminants that best separate the classes.
Sort eigenvalues in decreasing order and select those with the top k. The number of components k for LDA is at most c-1 (where c is the number of classes) or the number of features, whichever is smaller.
Construct transformation matrix W from the selected eigenvectors into the feature subspace.
Project samples using this transformation matrix to obtain the new, lower-dimensional dataset where class separability is maximized.

Conclusion

The lecture comprehensively exemplified concepts of PCA, LDA, and KPCA, focusing on their distinctions, applications, and detailed methodologies in the context of dimensionality reduction within machine learning settings. These techniques are indispensable for handling high-dimensional data, improving model efficiency, and enhancing generalizability.