Notes on Principal Component Analysis and Upcoming Labs Discussion
Upcoming Labs
- Two or three additional labs planned:
- Linear Regression Lab
- Logistic Regression Lab
- Possible Time Series Lab (pending opinion)
Discussion on Fads and Gimmicks
- Examples of fads:
- Silly bands
- Fidget spinners
- Pokémon cards - viewed as gambling/scams
- Exploration of health fads:
- Thoughts on coal plunges and placebo effect
Dimensionality Reduction Overview
- Concept of dimensionality and feature representation:
- Not all data fills out space in high dimensions
- Fitting functions to describe data can simplify models
Principal Component Analysis (PCA)
- Purpose: To find a linear model to fit data, creating new features to reduce dimensionality with minimal loss of information.
- Key Components of PCA:
- Standardization:
- Mean-centering and scaling features to achieve normalization.
- Standardization formula:
x_{standardized} = \frac{x - \bar{x}}{s}
where (\bar{x}) = mean, (s) = standard deviation.
- Variance and Covariance:
- Variance measures spread around the mean.
- Covariance indicates how two variables vary together.
- Formula for covariance between features:
Cov(X, Y) = E[(X - E[X])(Y - E[Y])] - Related to correlation (bounded between -1 and 1).
- Covariance captures direction and relationship magnitude between variables.
- Feature Correlation:
- High correlation can be detrimental as it doesn’t add meaningful predictive power.
- Covariance Matrix:
- A matrix representation of covariances among features.
- Helps to understand how each feature correlates with others.
- Steps in PCA:
- Compute covariance matrix and find eigenvalues and eigenvectors.
- Eigenvectors: Directions of highest variance.
- Eigenvalues: Amount of variance in the direction of each eigenvector, indicating importance.
Mathematical Concepts:
- Matrix Representation:
- Matrices used extensively to represent data and transformations.
- PCA involves matrix decomposition techniques like Singular Value Decomposition (SVD) for practical computations.
Resulting Features:
- Principal components (PCs) are linear combinations of the original features.
- Focus on the components that capture significant variation.
- PCA is an unsupervised method, meaning it doesn’t consider the output labels during transformation.
Dimensionality and Noise:
- PCA aims to retain features that capture significant variance while discarding less informative noise.
- The number of dimensions after PCA generally equals the original number of dimensions, but can significantly reduce the amount of information used.
Practical Implications:
- PCA is utilized in machine learning to simplify the dataset while retaining the informative aspects which help in better prediction and analysis.
- Eigen concepts: Important for defining component axes in PCA; eigenvectors determine the direction of principal components, while eigenvalues indicate their significance in terms of explained variance.
Conclusion:
- PCA simplifies high-dimensional data, aids in data visualization, model training, and improves the efficiency of algorithms by reducing redundancy and increasing interpretability.