PCA (Principal Component Analysis) - Video Notes Vocabulary Flashcards
What is PCA?
- Principal Component Analysis (PCA) is a dimensionality reduction technique used in data mining, machine learning, and deep learning to reduce data size by removing variables that are not contributing meaningfully to the problem.
- It summarizes data with a smaller set of uncorrelated variables called principal components (PCs) that collectively explain most of the variability in the data.
- PCA is an unsupervised data mining technique, meaning it does not require target labels. However, it can be used with supervised learning by preprocessing the predictors with PCA or by creating labels when they are missing.
- Practical purpose: reduce dimensionality to train models more efficiently while retaining the information that matters for the problem.
Key Concepts and Notation
- Let there be n observations (records) and k variables (features). Denote the variables as x1, x2, …, xk. The independent variables are commonly written as X and the dependent/target variable as y. Principal components are denoted as PC1, PC2, …, PCm (m ≤ k).
- Aim: transform the original k variables into a smaller number m of principal components that retain most of the information (variance) in the data.
- PCs are computed from the eigenstructure of the covariance (or correlation) matrix of the data.
- Important property: PCs are uncorrelated (orthogonal) to each other: Cov(PCi, PCj) = 0 for i ≠ j.
How PCA Works (Conceptual View)
- Step 1: Standardize the data (optional but often essential).
- Step 2: Compute the covariance matrix of the standardized data.
- Step 3: Find eigenvalues and eigenvectors of the covariance matrix.
- Step 4: Sort eigenvalues in descending order; corresponding eigenvectors form the loading vectors (weights) for the PCs.
- Step 5: Project the standardized data onto the eigenvectors to obtain principal component scores.
- Step 6: Decide how many PCs to keep based on explained variance (and a cumulative variance criterion).
Mathematical Foundations (Key Equations)
- Data standardization (Z-score) for a variable X with mean μ and standard deviation σ:
Z = \frac{X - \mu}{\sigma} - Covariance matrix for data matrix X (n observations, p variables):
\mathbf{C} = \frac{1}{n-1} (\mathbf{X} - \bar{\mathbf{X}})^T (\mathbf{X} - \bar{\mathbf{X}}) - Eigen decomposition of the covariance matrix:
\mathbf{C}\,\mathbf{v}i = \lambdai \mathbf{v}i
where (\lambdai) are eigenvalues (variances explained by each PC) and (\mathbf{v}_i) are the corresponding eigenvectors (loadings). - Principal component scores for an observation (example for a single PC): if the standardized data for an observation is (\mathbf{z}) and the loading vector is (\mathbf{w}) (eigenvector), then
t = \mathbf{z}^T \mathbf{w}
or for all observations, the matrix form is
\mathbf{T} = \mathbf{Z} \mathbf{W}
where (\mathbf{T}) contains the PC scores and (\mathbf{W}) contains the selected eigenvectors as columns. - Proportion of variance explained by PCj (Explained Variance):
\text{ExplainedVar}(PCj) = \frac{\lambdaj}{\sum{i=1}^{p} \lambdai} - Cumulative explained variance for the first m PCs:
\text{CumulativeVar}(m) = \sum{j=1}^{m} \text{ExplainedVar}(PCj) = \sum{j=1}^{m} \frac{\lambdaj}{\sum{i=1}^{p} \lambdai} - Interpretation: the first PC accounts for the maximum variation in the data, the second PC accounts for the remaining variation orthogonal to the first, and so on.
Two-Variable Example (Candy Bar Nutrition: fat x1 and protein x2)
- Setup: visualize data points with fat (x1) on the x-axis and protein (x2) on the y-axis.
- PCA goal in this view: identify the line (principal component axis) along which the data show maximum variability.
- PC1 (purple line in the example) captures the direction with the largest variance and thus carries most information about the data structure.
- PC2 (blue line) is orthogonal to PC1 and captures less variability/information.
- Intuition: by rotating the coordinate system to align with the directions of maximum variance, we summarize the data with fewer correlated axes ( PCs ) that still explain most of the spread.
- After applying PCA, the resulting principal components are uncorrelated (orthogonal) and capture the majority of information.
- Manual intuition (two variables): compute PCs by examining how data vary along potential axes; PC1 aligns with the direction of maximum spread, PC2 with the remaining spread orthogonal to PC1.
- In practice, you compute PC weights (loadings) from eigenvectors and then compute PC scores by combining standardized variable values with those weights.
- For two variables (x1 = fat, x2 = protein), you would have at most PC1 and PC2. Depending on the explained variance, you may keep PC1 (and possibly PC2) for modeling.
- Example insight: if PC1 explains ~80% of the information and PC2 explains ~20%, you could train with PC1 alone for a simpler model with minor information loss, or keep PC1+PC2 if the slight gain in explained variance justifies the extra features.
Why Standardization Matters in PCA
- PCA is sensitive to the scale of variables. Variables with larger scales can dominate the principal components.
- Standardizing each variable to zero mean and unit variance (Z-scores) ensures that all variables contribute proportionally to the analysis.
- Standardization step used before PCA:
Z = \frac{X - \mu}{\sigma}
From Theory to Practice: Python (Colab) Workflow (High-Level)
- Data source example: World Bank health dataset with country-level health and population measures across 38 countries.
- Typical workflow:
- Load data (e.g., from Excel or CSV) and select relevant sheet/columns.
- Preprocessing: drop non-numeric columns (e.g., country name) or set them as index.
- Standardize data using StandardScaler from sklearn.preprocessing.
- Apply PCA via sklearn.decomposition.PCA and fit_transform the standardized data to obtain PC scores.
- Inspect explainedvarianceratio_ (and explainedvariance) to determine how much variance is captured by each PC and compute the cumulative variance (e.g., using numpy.cumsum).
- Build a table of PC scores with names PC1, PC2, …, PCm and, if desired, associate them with their original identifiers (e.g., country names as index).
- Decide how many PCs to retain (e.g., top 4 PCs capturing ~93% of the information in the World Bank example) and use them as inputs for downstream modeling (regression, classification, clustering, etc.).
- Save results to a file (Excel/CSV) for reporting or further analysis.
- Code notes (high-level, not exhaustive):
- Read data with pandas (e.g., pd.readexcel or pd.readcsv).
- Drop unnecessary columns (e.g., country name) before scaling.
- Use StandardScaler().fit_transform(data) to get standardized data.
- Use PCA(ncomponents=…) to specify how many components to keep; call .fittransform(standardized_data) to obtain PC scores.
- Access PC loadings via .components_ and PC explained variances via .explainedvariance and .explainedvarianceratio_.
- Build a DataFrame of PC scores with column names PC1, PC2, …, and optionally set an index with the country names.
- To export results, use DataFrame.toexcel or DataFrame.tocsv as appropriate.
World Bank Health Data: What the Dataset Looks Like
- Country-level health and population measures for 38 countries.
- Variables include:
- Death rate per 1000 people (mortality rate)
- Health expenditure per capita
- Life expectancy at birth
- Male adult mortality rate per 1000
- Female adult mortality rate per 1000
- Female population growth
- Male population growth
- Total population
- (Other health and demographic indicators)
- Goal: reduce dimensionality by identifying a smaller set of principal components that capture the majority of information across these variables.
Interpreting PCA Results (What to Look For)
- Explained variance per component:
- Example: PC1 explains a large portion (e.g., 46.62%), PC2 explains ~24.7%, PC3 ~15.32%, etc. These numbers indicate how much of the total data variance is captured by each PC.
- Cumulative variance: see how many PCs are needed to reach a desired information threshold (e.g., 90% or 95%).
- Example from the World Bank data: top 4 PCs explain ~93.35% of meaningful information; 5 PCs may cover even more variance, while you can drop the remaining PCs with diminishing returns.
- PC loadings (weights): the components of the eigenvectors indicate how original variables contribute to each PC. Large absolute loading values indicate stronger contributions.
- PC score table: for each observation (e.g., each country), PC scores indicate where that observation lies along each PC axis. This is useful for downstream modeling and visualization (e.g., clustering countries by PC scores).
Practical Usage Choices
- Dimensionality decisions:
- Keep enough PCs to retain the desired amount of variance (e.g., 90–95%). This reduces dimensionality while preserving information.
- If a simple model is preferred, use fewer PCs (e.g., PC1–PC4) that capture most variance.
- When to drop original variables:
- Original variables are transformed into PCs; the original variables are no longer directly used in the model unless you map results back for interpretation.
- When PCA is not ideal:
- PCA assumes linear relationships and aims to preserve linear variance; it may not capture nonlinear structures.
- Outliers can distort PCs; robust variants or preprocessing may be necessary.
- Practical implications:
- Reduces memory and computation time for large datasets.
- Can improve model performance by removing noisy or redundant features.
- Helps with visualization in 2D/3D by projecting data onto PC axes.
Summary Takeaways
- PCA replaces many original correlated variables with a smaller set of uncorrelated principal components that capture most of the information in the data.
- The first principal component accounts for the maximum variance; subsequent components account for remaining variance, each orthogonal to the previous ones.
- Standardization is typically essential because PCA is sensitive to variable scales.
- The number of PCs to keep is chosen based on explained variance and the practical needs of the modeling task.
- In practice, PCA is implemented in tools like Python (scikit-learn) and R, and the results are used to train models with fewer predictors while retaining most informational content.
- Z-score standardization: Z = \frac{X - \mu}{\sigma}
- Covariance matrix: \mathbf{C} = \frac{1}{n-1} (\mathbf{X} - \bar{\mathbf{X}})^T (\mathbf{X} - \bar{\mathbf{X}})
- Eigen decomposition: \mathbf{C}\,\mathbf{v}i = \lambdai \mathbf{v}_i
- PC scores (matrix form): \mathbf{T} = \mathbf{Z} \mathbf{W}
- Individual PC score: t = \mathbf{z}^T \mathbf{w}
- Explained variance for PCj: \text{ExplainedVar}(PCj) = \frac{\lambdaj}{\sum{i=1}^{p} \lambdai}
- Cumulative explained variance: \text{CumulativeVar}(m) = \sum{j=1}^{m} \frac{\lambdaj}{\sum{i=1}^{p} \lambdai}
- Uncorrelated components: \operatorname{Cov}(PCi, PCj) = 0 \quad (i \neq j)