Dimensionality Reduction Study Notes

Dimensionality Reduction: The process involves condensing dataset information while preserving essential characteristics. The primary objective is to reduce algorithmic complexity and enhance model performance by removing redundant or irrelevant features without significant information loss.
- It compares dimensionality reduction with feature selection: feature selection is akin to projecting the feature space to a lower-dimensional subspace perpendicular to removed features while dimensionality reduction can involve other types of projections such as PCA that represent data using linear combinations of original features.

Dimensionality: Refers to the number of features or attributes in the dataset.
- Datasets can have a very large number of features.
- Example in text corpus: Each distinct word can be a feature (bag of words model).
- Example in image datasets: Each of the 1024 x 768 pixels can be treated as a feature.
- Goal: Reduce the number of features while minimizing information loss to ensure good performance in clustering, classification, etc.

The feature space may be sparsely populated. For example, in a text corpus, individual words may appear in a small subset of documents, leading to:
- Poor performance of ML models on sparse feature spaces.
- Models rely on statistical counts of observations in various regions of feature space.
- As dimensionality increases, there are fewer observations per region, leading to the Curse of Dimensionality: Training examples needed increase exponentially with dimensionality.
- Other reasons include the presence of irrelevant or redundant features (highly correlated features) and the desire to visualize high-dimensional data.

Feature Selection:
- Selecting a subset of features from the original set.
- Example: For spam email classification, differentiate between time of day of the email and the number of spam-words.
Feature Extraction:
- Define a new set of features smaller than the original set.
- Example: Given marks in 8 subjects, total variation may be captured through 3 new dimensions – Science, Social Science, Arts.

Supervised Methods: Utilize both feature values and class labels of data points.
- Techniques are focused on selecting a subset of original features based on scores assigned to each feature.
Unsupervised Methods: Utilize only feature values, not class labels.
- These methods project the data points onto lower-dimensional space while retaining maximum information (e.g., PCA minimizes reconstruction error).

Involves calling a learning method multiple times to help select features based on their performance in constructing a predictive model.

State: Set of features.
Start State: Empty (in forward selection) or full (in backward elimination).
Operators: Add/subtract features based on scoring using training or validation accuracy.

Forward Selection
- Initialize an empty set and iteratively add features that improve model score until no further improvements can be made.
Backward Elimination
- Start with all features and iteratively remove the least significant ones based on performance evaluation until score decreases.

Train a model on all features.
Rank features by coefficient magnitude.
Remove the lowest-ranked features iteratively until model performance decreases.

PCA: A fundamental technique for extracting the most informative features and transforming high-dimensional data into a manageable low-dimensional form.
- It identifies orthogonal axes called principal components (PCs) capturing maximum variance upon projection.

Compute the mean of data points.
Mean center the data: calculate $x_i - ar{x}$
Compute covariance matrix: S = rac{1}{n} (X - ar{X})^T (X - ar{X})
Perform eigenvalue decomposition: $S = V ext{diag}( ext{eigenvalues}) V^T$
Select top $m$ eigenvectors corresponding to the largest eigenvalues (these are the principal components).
Create the projection matrix: $U = [v_1, v_2, …, v_m]$
Project data onto lower-dimensional space: $Y = U^T X$

Dimensionality reduction implies some degree of information loss; PCA aims to minimize this by managing reconstruction error effectively.
Each principal component accounts for as much of the data variability possible, ensuring that dimensions are orthogonal, thus uncorrelated.
The number of principal components can be chosen based on desired variance retention (e.g., 90% or 95%).

Covariance: A measure of how two variables change together; it can be positive or negative depending on the correlation direction.
Covariance Matrix: A $M imes M$ matrix where each entry $c_{ij}$ measures the covariance between features $i$ and $j$.

PCA involves rotating the coordinate system to align the axes with maximum data variation; this involves the eigenvalues and eigenvectors of the covariance matrix.
Eigenvectors correspond to directions of maximum variance; eigenvalues indicate the amount of variance along their associated eigenvector direction.