DS

Notes on Principal Component Analysis and Upcoming Labs Discussion

  • Upcoming Labs

    • Two or three additional labs planned:
    • Linear Regression Lab
    • Logistic Regression Lab
    • Possible Time Series Lab (pending opinion)
  • Discussion on Fads and Gimmicks

    • Examples of fads:
    • Silly bands
    • Fidget spinners
    • Pokémon cards - viewed as gambling/scams
    • Exploration of health fads:
    • Thoughts on coal plunges and placebo effect
  • Dimensionality Reduction Overview

    • Concept of dimensionality and feature representation:
    • Not all data fills out space in high dimensions
    • Fitting functions to describe data can simplify models
  • Principal Component Analysis (PCA)

    • Purpose: To find a linear model to fit data, creating new features to reduce dimensionality with minimal loss of information.
    • Key Components of PCA:
    • Standardization:
      • Mean-centering and scaling features to achieve normalization.
      • Standardization formula:
        x_{standardized} = \frac{x - \bar{x}}{s}
        where (\bar{x}) = mean, (s) = standard deviation.
    • Variance and Covariance:
      • Variance measures spread around the mean.
      • Covariance indicates how two variables vary together.
      • Formula for covariance between features:
        Cov(X, Y) = E[(X - E[X])(Y - E[Y])]
      • Related to correlation (bounded between -1 and 1).
      • Covariance captures direction and relationship magnitude between variables.
    • Feature Correlation:
      • High correlation can be detrimental as it doesn’t add meaningful predictive power.
    • Covariance Matrix:
      • A matrix representation of covariances among features.
      • Helps to understand how each feature correlates with others.
    • Steps in PCA:
      • Compute covariance matrix and find eigenvalues and eigenvectors.
      • Eigenvectors: Directions of highest variance.
      • Eigenvalues: Amount of variance in the direction of each eigenvector, indicating importance.
  • Mathematical Concepts:

    • Matrix Representation:
    • Matrices used extensively to represent data and transformations.
    • PCA involves matrix decomposition techniques like Singular Value Decomposition (SVD) for practical computations.
  • Resulting Features:

    • Principal components (PCs) are linear combinations of the original features.
    • Focus on the components that capture significant variation.
    • PCA is an unsupervised method, meaning it doesn’t consider the output labels during transformation.
  • Dimensionality and Noise:

    • PCA aims to retain features that capture significant variance while discarding less informative noise.
    • The number of dimensions after PCA generally equals the original number of dimensions, but can significantly reduce the amount of information used.
  • Practical Implications:

    • PCA is utilized in machine learning to simplify the dataset while retaining the informative aspects which help in better prediction and analysis.
    • Eigen concepts: Important for defining component axes in PCA; eigenvectors determine the direction of principal components, while eigenvalues indicate their significance in terms of explained variance.
  • Conclusion:

    • PCA simplifies high-dimensional data, aids in data visualization, model training, and improves the efficiency of algorithms by reducing redundancy and increasing interpretability.