Multivariate Data Analysis - Principal Component Analysis

Variance estimates how far a set of numbers are spread out from their mean value.
Covariance measures the relationship between the variance of two variables.
Covariance indicates how two variables are related and whether they vary together.
Covariance is denoted as $Cov(X, Y)$ . The formulas are as follows:

Given two sets of data:
- $X = {x1, …, xi, …, x_n}$
- $Y = {y1, …, yi, …, y_n}$
Covariance can be calculated as:
- $Cov(X,Y) = \frac{\sum{i=1}^{n} (xi - \mux)(yi - \mu_y)}{n}$
- Where $\mux$ and $\muy$ are the means of X and Y respectively.
Simplified formula:
- $Cov(X,Y) = \frac{\sum{i=1}^{n} xi yi}{n} - \mux \mu_y$
Covariance is symmetric:
- $Cov(X,Y) = Cov(Y,X)$

Correlation is denoted as $\rho(X, Y)$ .
Formula:
- $\rho(X,Y) = \frac{Cov(X, Y)}{\sigmax \sigmay}$
- Where $\sigmax$ and $\sigmay$ are the standard deviations of X and Y respectively.
Correlation is symmetric:
- $\rho(X,Y) = \rho(Y, X)$
Pearson Product-Moment Correlation Coefficient is mentioned.

A non-singular matrix Mat is created using numpy:
Mat=np.asmatrix(np.array([[8,-6,2], [-6,7,-4], [2, -4,1]])) print (Mat) print (type (Mat))
The type of Mat is a numpy.matrix.
Covariance between each column of Mat is calculated using np.cov():
print (np.cov (Mat))
Pearson Correlation coefficient is calculated using np.corrcoef():
print (np.corrcoef (Mat))

For a data set $X = {x1, …, xi, …, x_n}$ , where $X \in R^2$ , and
- $X1 = {x1[1], …, xi[1], …, xn[1]}$
- $X2 = {x1[2], …, xi[2], …, xn[2]}$
The covariance matrix C is defined as:
- $C = \begin{bmatrix} Var(X1) & Cov(X1, X2) \ Cov(X2, X1) & Var(X2) \end{bmatrix}$

Origins of frame {A} and {B} do not coincide.
Corresponding axes of {A} and {B} are parallel.
Relationship between a point P in frame {A} and frame {B}:
- $^B P = ^A P - ^A O_B$
- Where $^A O_B$ is the origin of frame {B} expressed in frame {A}.
Example:
- If $^A P = (10, 10)$ and $^A O_B = (3, 5)$ , then $^B P = (10-3, 10-5) = (7, 5)$ .

A point P has coordinates (u, v) in frame {A} and (x, y) in frame {B}.
The relationship between (u, v) and (x, y) is given by:
- $u = x \cos(\theta) - y \sin(\theta)$
- $v = x \sin(\theta) + y \cos(\theta)$
This can be expressed in matrix form as:
- $\begin{bmatrix} u \ v \end{bmatrix} = \begin{bmatrix} \cos(\theta) & -\sin(\theta) \ \sin(\theta) & \cos(\theta) \end{bmatrix} \begin{bmatrix} x \ y \end{bmatrix}$
The rotation matrix from frame {B} to frame {A} is:
- $^A R_B = \begin{bmatrix} \cos(\theta) & -\sin(\theta) \ \sin(\theta) & \cos(\theta) \end{bmatrix}$
The rotation matrix from frame {A} to frame {B} is the transpose of $^A R_B$ :
- $^B RA = (^A RB)^T = \begin{bmatrix} \cos(\theta) & \sin(\theta) \ -\sin(\theta) & \cos(\theta) \end{bmatrix}$

PCA is a statistical procedure to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
Each principal component is chosen to describe the most available variance.
All principal components are orthogonal to each other.
The first principal component has the maximum variance.
PCA is a non-dependent method used for reducing attribute space from a larger number of variables to a smaller number of factors.
It is a dimension reducing technique but with no assurance whether the dimension would be interpretable.
In PCA, the main job is selecting the subset of variables from a larger set, depending on which original variables will have the highest correlation with the principal amount.

Let the data matrix X be of n×p size, where n is the number of samples and p is the number of variables.
Assume that X is centered (column means have been subtracted and are now equal to zero).
The p×p covariance matrix C is given by:
- $C = \frac{X^T X}{n-1}$ .
C is a symmetric matrix and can be diagonalized:
- $C = E \Lambda E^T$
- Where E is a matrix of eigenvectors (each column is an eigenvector) and Λ is a diagonal matrix with eigenvalues $\lambda$ in decreasing order on the diagonal.
The eigenvectors are called principal axes or principal directions of the data.
Projections of the data on the principal axes are called principal components, also known as PC scores; these can be seen as new, transformed, variables.

Sklearn.dataset package has a dataset on iris flowers
Dataset have 4 parameters sepal length sepal width petal length petal width
y has categorical data weather flower belongs to setosa, versicolor or virginica family
Iris setosa, Iris versicolor, Iris virginica are the three species.

Calculate covariance matrix:
- co = np.cov(X.T)
Calculate correlation matrix:
- cor = np.corrcoef(X.T)
Calculate eigenvalues and eigenvectors of the covariance matrix:
- eig_val, eig_vec = np.linalg.eig(co)
Project dataset on respective eigenvector to have principal axis.
Sort eigenvectors with respect to eigenvalues.
Identify principal components which retain up to 95% of information.

Convert the principal component vectors to a Pandas DataFrame:
- df = pd.DataFrame(Q_hat)
Plotting scatter plot in reference with y.