11.4 Principal Component Analysis

Context and purpose
- PCA is a useful dimension reduction technique used when a data set has many variables (high-dimensional data).
- Useful when we want to reduce the number of variables but cannot completely drop variables without losing important information.
- It is an unsupervised data mining technique that helps summarize data with a smaller set of representative variables (principal components) that explain most of the variability.
- PCA can be used for visualization and pre-processing before applying other unsupervised methods (e.g., clustering).
- PCA can also be used in supervised learning (e.g., linear/regression) to substitute a large set of predictors with a smaller number of principal components.
- Important caveat: PCA is sensitive to the scale of the variables; standardization is recommended before applying PCA and to ensure components capture genuine structure rather than scale differences.
Key concepts and terminology
- Principal components (PCs): uncorrelated, weighted linear combinations of the original (standardized) variables that capture maximal variance in order (PC1 captures the most variance, PC2 the second most, etc.).
- Loadings (weights): coefficients that define each PC as a linear combination of the original standardized variables. The loading matrix collects these weights across PCs and variables.
- Scores: the coordinates of observations on the PC axes (i.e., the transformed data in the PC space).
- Orthogonality: PCs are uncorrelated with one another.
- Variance explained: each PC explains a portion of the total variance; typically the first few PCs explain most of the variance.
- Rotation of axes: after PCA, the data can be viewed from a rotated coordinate system where the axes align with the directions of greatest variance; the data themselves do not change, only the axes do.
Mathematical foundations (formulas)
- Let X be the data matrix with n observations and k original variables.
- Standardization of variables:
  z{ik} = rac{x{ik} - ar{x}k}{sk} \,, where ar{x}k is the mean of variable k and sk is its standard deviation.
- Form the standardized data matrix Z of size n x k.
- Compute the covariance matrix (or correlation matrix if variables are standardized):
  S = \frac{1}{n-1} Z^T Z.
- Eigen decomposition of the covariance matrix:
  S \, vm = \lambdam \, vm, \quad m = 1, \dots, k, with eigenvalues ordered \lambda1 \ge \lambda2 \ge \dots \ge \lambdak\,.
- Principal component loadings are the eigenvectors; the weights of the m-th PC across variables are the elements of v_m.
- Principal component scores (transformed coordinates) for observation i on component m:
  PC{im} = \sum{k=1}^K z{ik} \; vm^{(k)}.
  Equivalently, if W is the matrix of eigenvectors (loadings) with columns v_m, then the score matrix T = Z W.
- Variance explained by PC m:
  \text{Var}(PCm) = \lambdam.
  Proportion of total variance explained by PC m:
  \text{PVE}m = \dfrac{\lambdam}{\sum{j=1}^k \lambdaj}.
Two-variable example (Fat and Protein) – conceptual insights
- In a simplified example with Fat and Protein, a single principal component can capture most of the variability between the two variables.
- Loadings for the two-variable case (example values):
- PC1 loadings: v1^{(Fat)} = -\tfrac{1}{\sqrt{2}} = -0.7071, v1^{(Protein)} = \tfrac{1}{\sqrt{2}} = 0.7071.
- PC2 loadings: v2^{(Fat)} = \tfrac{1}{\sqrt{2}} = 0.7071, v2^{(Protein)} = \tfrac{1}{\sqrt{2}} = 0.7071.
- The first principal component (PC1) captures the majority of the variation between Fat and Protein; the second (PC2) captures the remaining variation and is uncorrelated with PC1.
- Data for the two variables can be replaced by PC1 with little loss of information if PC1 explains most of the variance. PC2 then accounts for the residual variance and is orthogonal to PC1.
- Practical note: in two-variable cases, there are at most two principal components; PCA requires standardization because PCA is sensitive to the scale of the original variables.
Interpretation and visualization aspects
- Rotation interpretation: Observations are rotated into a new coordinate system defined by PC1, PC2, etc.; the relative positions of observations do not change, only the axes used to describe them.
- PC1 corresponds to the direction with the maximum total variance; PC2 corresponds to the direction of second-largest variance and is orthogonal to PC1, and so on.
- The first few PCs often capture a large portion of the total variance, enabling dimensionality reduction by keeping only the first few components.
- Uncorrelated PCs help mitigate multicollinearity in downstream analyses.
- Standardization before PCA is recommended because PCA is sensitive to scale; without standardization, variables with larger scales may dominate the PCs.
World Bank health data example (illustrative context)
- Dataset includes country-level health and population measures for 38 countries, such as:
- Death rate per 1,000 people (Death Rate, in %)
- Health expenditure per capita (Health Expend, in US$)
- Life expectancy at birth (Life Exp, in years)
- Male adult mortality rate per 1,000 male adults (Male Mortality)
- Female adult mortality rate per 1,000 female adults (Female Mortality)
- Annual population growth (Population Growth, in %)
- Female population (Female Pop, in %)
- Male population (Male Pop, in %)
- Total population (Total Pop)
- Size of labor force (Labor Force)
- Fertility rate (Births per woman) and birth rate per 1,000 people (Fertility Rate, Birth Rate)
- The example illustrates applying PCA to real-world, multidimensional health data to identify principal components that summarize health and population indicators for many countries.
Practical implementation notes (tools and workflows mentioned)
- Excel: used to perform variance distribution, weights, and scores visualization for the example.
- Python (Colab): steps include a) Variance Distribution, b) Principal Component Weights, c) Principal Component Scores.
- R: summary results (importance of components) and PCA weights; viewing R’s principal components (PC1, PC2, PC3, etc.).
- The emphasis is on understanding how PCs capture variance and how scores/weights are interpreted across tools.
Example outputs and interpretations (conceptual descriptions)
- PC loadings indicate how much each original variable contributes to a given PC.
- PC scores indicate the projection of each observation onto the PC axes.
- A rotation of axes aligns axes with directions of maximum variance; data values themselves remain the same, only their coordinate representation changes.
- In larger datasets with many variables, selecting the first few PCs can retain most of the information while reducing dimensionality and avoiding multicollinearity in downstream analyses.
Summary takeaways
- PCA is a powerful unsupervised method for reducing dimensionality by transforming correlated variables into a smaller set of uncorrelated principal components that capture most of the data’s variability.
- Standardization is a crucial preprocessing step to ensure that all variables contribute appropriately to the PCs.
- The key outputs are PC loadings (weights), PC scores (component values for observations), and the proportion of variance explained by each PC.
- PCs are linear combinations of standardized variables; they are ordered by the amount of variance they explain; the first few PCs often suffice for practical purposes.
Connections to broader data mining principles
- PCA reduces dimensionality while preserving as much information as possible under linear transformations.
- It complements other unsupervised methods (e.g., clustering) and can simplify supervised learning pipelines by reducing feature space and mitigating multicollinearity.
- While PCA helps summarize data, it does not imply causation; interpretation of PCs requires examining loadings and context.
Notation recap
- Original data: X = [x_{ik}] \in \mathbb{R}^{n \times k}
- Standardized data: Z = [z{ik}] \in \mathbb{R}^{n \times k}, \quad z{ik} = \frac{x{ik} - \bar{x}k}{s_k}
- Covariance matrix: S = \frac{1}{n-1} Z^T Z
- Eigen decomposition: S vm = \lambdam v_m
- Principal component scores: PC{im} = \sum{k=1}^k z{ik} vm^{(k)}
- Score matrix: T = Z W where W = [v1, v2, \dots, vk] and PCm$ explains a variance of \lambdam with proportion \lambdam / \sumj \lambdaj$$.
Quick recap of key conventions
- PC1: direction of maximum variance; PC2: orthogonal direction of second-most variance; etc.
- PCs are uncorrelated by construction.
- The more PCs kept, the higher the explained variance but greater dimensionality; typically, a small number of PCs capture most of the information.
Practical implications and ethical/philosophical notes
- Using PCA transforms the data into a new feature space; interpretability of PCs can be challenging because each PC is a mix of original variables.
- The transformation is linear; nonlinear structures may not be captured well (kernel PCA is a nonlinear alternative).
- Decisions based on PCA should consider whether the reduced dimensions retain the aspects of the data that are relevant to the task, especially in sensitive applications.