Comprehensive Notes on Fair Principal Component Analysis (Fair PCA)

Standard PCA Formulation: Principal Component Analysis is formulated as a reconstruction error minimization problem: * Objective: $\min_{U} |X - XUU^T|_F^2$ * Constraint: $U^T U = I$
Interpretation and Process: * Data is projected onto a lower-dimensional subspace using the projection matrix $U$ . * The data is then reconstructed as $XUU^T$ . * The goal is to minimize the Frobenius norm ( $F$ ) of the difference between original data and its reconstruction.
Equivalent Mathematical Interpretation: * The problem is equivalent to maximizing the variance captured in the subspace: $\max_{U} tr(U^T X^TX U)$ . * The optimal directions correspond to the eigenvectors of the covariance matrix of $X$ . * PCA identifies the directions (eigenvectors) that represent the maximum variance in the dataset.

Standard Treatment vs. Reality: Classical PCA treats the entire dataset as a single, homogenous population. However, real-world data is often composed of distinct demographic groups.
Common Groups and Labels: * Gender: Male and Female. * Race: Group A and Group B.
Data Splitting Notation: The dataset $X$ is split based on group membership: * $X = [X_A, X_B]$ * $X_A$ : Represents the privileged group (Group A). * $X_B$ : Represents the unprivileged or harmed group (Group B).

Average Reconstruction Error for Group A ( $RE_{X_A}(U)$ ): * $\frac{1}{n_A} |X_A - X_A UU^T|_F^2$
Average Reconstruction Error for Group B ( $RE_{X_B}(U)$ ): * $\frac{1}{n_B} |X_B - X_B UU^T|_F^2$
Definition of Unfairness: PCA is considered unfair if there is a significant discrepancy in how well groups are represented: * RE_{X_A}(U) < RE_{X_B}(U) * In this scenario, Group A is reconstructed better than Group B. Group B loses more information during the dimensionality reduction process, which constitutes the core fairness issue.

Disparity Calculation ( $D_{X_B, X_A}(U)$ ): * $D = \frac{1}{n_B} |X_B - X_B UU^T|_F^2 - \frac{1}{n_A} |X_A - X_A UU^T|_F^2$
Interpreting Disparity Values: * D > 0: Group B is worse off than Group A (Unfair). * $D = 0$ : Perfectly fair representation; both groups suffer equal information loss. * D < 0: Roles are reversed; Group A is worse off than Group B.
Fairness Goal: The ultimate objective is to achieve zero disparity ( $D = 0$ ) and equal reconstruction error ( $RE_{X_A} \approx RE_{X_B}$ ), ensuring a fair subspace projection where no group loses significantly more information than another.

Alignment Bias: PCA finds the direction maximizing overall variance across the combined dataset. If the distribution of Group A is highly spread out in a specific direction while Group B is spread in a different direction, the PCA direction will naturally align with the group contributing the most to the global variance (typically the privileged or larger group).
Resulting Disparity: Group A points remain "well spread" and maintain low error, while Group B points are "compressed" into the subspace, resulting in high reconstruction error.

Dataset Definition: * Group A (Horizontal Spread): Points are $(2, 0)$ , $(3, 0)$ , and $(4, 0)$ . Resulting matrix $X_A = \begin{pmatrix} 2 & 3 & 4 \ 0 & 0 & 0 \end{pmatrix}^T$ . * Group B (Vertical Spread): Points are $(0, 1)$ , $(0, 2)$ , and $(0, 3)$ . Resulting matrix $X_B = \begin{pmatrix} 0 & 0 & 0 \ 1 & 2 & 3 \end{pmatrix}^T$ .
Combined Dataset Analysis: * Variance in x-direction consists of values $2, 3, 4$ . * Variance in y-direction consists of values $1, 2, 3$ . * Because the variance is larger in the horizontal direction, PCA chooses the first principal component as $u_1 = [1, 0]^T$ .
Reconstruction and Error Calculation: * Projection: Projecting onto the x-axis means all y-components are discarded. * Group A Reconstruction: Since all points are already on the x-axis, the projection is lossless. $RE_{X_A}(U) = 0$ . * Group B Reconstruction: Points $(0, 1), (0, 2), (0, 3)$ collapse to the origin $(0, 0)$ . The squared errors are $(1^2 + 2^2 + 3^2) = 1 + 4 + 9 = 14$ . * Average Error for B: $RE_{X_B}(U) = \frac{14}{3} \approx 4.67$ .
Disparity Result: * $D = 4.67 - 0 = 4.67$ . This indicates strong unfairness as Group B suffers significantly whereas Group A suffers zero error.

The Optimization Problem: For a fixed weighting factor $\lambda \in [0, 1]$ , the Fair PCA objective is: * $\min_{U} J(U) = \lambda RE_X(U) + (1 - \lambda) D_{X_B, X_A}(U)$ * Subject to $U^T U = I$ .
The Main Result: The optimal projection $U$ is formed by the top $r$ eigenvectors of the fairness-adjusted matrix $\hat{C}$ , defined as: * $\hat{C} = \lambda \frac{1}{n} X^T X + (1 - \lambda) \left( \frac{1}{n_B} X_B^T X_B - \frac{1}{n_A} X_A^T X_A \right)$
Components of $\hat{C}$ : * Standard PCA term ( $\lambda \cdot \frac{X^T X}{n}$ ): Preserves overall variance across the full dataset. * Fairness correction term ( $(1 - \lambda) ( \frac{X_B^T X_B}{n_B} - \frac{X_A^T X_A}{n_A} )$ ): Favors directions that reduce the reconstruction gap between the groups.

Step 1: Expand the Objective: Use the PCA reconstruction identity for any data matrix $M$ : * $|M - MUU^T|_F^2 = tr(MM^T) - tr(U^T M^T M U)$ * Applying this yields errors for the full dataset ( $RE_X$ ), Group A ( $RE_A$ ), and Group B ( $RE_B$ ).
Step 2: Substitute into $J(U)$ : Substitute the expansions into the combined objective and distribute the signs: * $J(U) = \lambda ( \frac{1}{n} tr(XX^T) - \frac{1}{n} tr(U^T X^T X U) ) + (1 - \lambda) ( RE_B - RE_A )$
Step 3: Separate Variables: Collect all terms that do not involve $U$ into a single scalar constant $\mu$ : * $\mu = \lambda \frac{1}{n} tr(XX^T) + (1 - \lambda) \left( \frac{1}{n_B} tr(X_B X_B^T) - \frac{1}{n_A} tr(X_A X_A^T) \right)$ * Objective becomes: $J(U) = \mu - \lambda \frac{1}{n} tr(U^T X^T X U) - (1 - \lambda) \left( \frac{1}{n_B} tr(U^T X_B^T X_B U) - \frac{1}{n_A} tr(U^T X_A^T X_A U) \right)$
Step 4: Combine Trace Terms: By the linearity of the trace, group the $U$ -dependent variables together: * $J(U) = \mu - tr(U^T \hat{C} U)$ * Minimizing $J(U)$ is equivalent to maximizing $tr(U^T \hat{C} U)$ .
Step 5: Solve for Eigenvectors: This problem is now mathematically identical to classical PCA. Applying Lagrange multipliers: * Maximize $u_1^T \hat{C} u_1$ subject to $u_1^T u_1 = 1$ . * This leads to the eigenvalue problem: $\hat{C} u_1 = \mu_1 u_1$ . * To find $U$ , we stack the top $r$ eigenvectors of $\hat{C}$ .

Computational Efficiency: It remains an eigenvalue problem, meaning no iterative solver is required. A single eigendecomposition of $\hat{C}$ is sufficient.
Tunability: The parameter $\lambda$ allows for a smooth interpolation between standard PCA (when $\lambda = 1$ ) and pure disparity minimization (when $\lambda = 0$ ).
Interpretability: The fairness-adjusted matrix $\hat{C}$ has a clear dual structure of global variance preservation combined with a group disparity correction mechanism.