Multivariate Data Analysis - Principal Component Analysis
Variance and Covariance
- Variance estimates how far a set of numbers are spread out from their mean value.
- Covariance measures the relationship between the variance of two variables.
- Covariance indicates how two variables are related and whether they vary together.
- Covariance is denoted as . The formulas are as follows:
Covariance Formulas
- Given two sets of data:
- Covariance can be calculated as:
- Where and are the means of X and Y respectively.
- Simplified formula:
- Covariance is symmetric:
Correlation
- Correlation is denoted as .
- Formula:
- Where and are the standard deviations of X and Y respectively.
- Correlation is symmetric:
- Pearson Product-Moment Correlation Coefficient is mentioned.
Variance and Covariance in NumPy
- A non-singular matrix
Matis created usingnumpy:Mat=np.asmatrix(np.array([[8,-6,2], [-6,7,-4], [2, -4,1]])) print (Mat) print (type (Mat)) - The type of
Matis anumpy.matrix. - Covariance between each column of
Matis calculated usingnp.cov():print (np.cov (Mat)) - Pearson Correlation coefficient is calculated using
np.corrcoef():print (np.corrcoef (Mat))
Covariance Matrix
- For a data set , where , and
- The covariance matrix C is defined as:
Coordinate Transformations
Translation
- Origins of frame {A} and {B} do not coincide.
- Corresponding axes of {A} and {B} are parallel.
- Relationship between a point P in frame {A} and frame {B}:
- Where is the origin of frame {B} expressed in frame {A}.
- Example:
- If and , then .
Rotation
- A point P has coordinates (u, v) in frame {A} and (x, y) in frame {B}.
- The relationship between (u, v) and (x, y) is given by:
- This can be expressed in matrix form as:
- The rotation matrix from frame {B} to frame {A} is:
- The rotation matrix from frame {A} to frame {B} is the transpose of :
Principal Component Analysis (PCA)
- PCA is a statistical procedure to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
- Each principal component is chosen to describe the most available variance.
- All principal components are orthogonal to each other.
- The first principal component has the maximum variance.
- PCA is a non-dependent method used for reducing attribute space from a larger number of variables to a smaller number of factors.
- It is a dimension reducing technique but with no assurance whether the dimension would be interpretable.
- In PCA, the main job is selecting the subset of variables from a larger set, depending on which original variables will have the highest correlation with the principal amount.
Mathematical Formulation of PCA
- Let the data matrix X be of n×p size, where n is the number of samples and p is the number of variables.
- Assume that X is centered (column means have been subtracted and are now equal to zero).
- The p×p covariance matrix C is given by:
- .
- C is a symmetric matrix and can be diagonalized:
- Where E is a matrix of eigenvectors (each column is an eigenvector) and Λ is a diagonal matrix with eigenvalues in decreasing order on the diagonal.
- The eigenvectors are called principal axes or principal directions of the data.
- Projections of the data on the principal axes are called principal components, also known as PC scores; these can be seen as new, transformed, variables.
Data Set used for PCA
- Sklearn.dataset package has a dataset on iris flowers
- Dataset have 4 parameters sepal length sepal width petal length petal width
- y has categorical data weather flower belongs to setosa, versicolor or virginica family
- Iris setosa, Iris versicolor, Iris virginica are the three species.
Implementation using NumPy
Calculate covariance matrix:
co = np.cov(X.T)
Calculate correlation matrix:
cor = np.corrcoef(X.T)
Calculate eigenvalues and eigenvectors of the covariance matrix:
eig_val, eig_vec = np.linalg.eig(co)
Project dataset on respective eigenvector to have principal axis.
Sort eigenvectors with respect to eigenvalues.
Identify principal components which retain up to 95% of information.
Implementation using Pandas
Convert the principal component vectors to a Pandas DataFrame:
df = pd.DataFrame(Q_hat)
Plotting scatter plot in reference with y.
Implementation using Scikit-learn
- Using in built PCA function from sklearn.decomposition library package
- Model Generated to return 2 components after PCA
- Fitting the model to our dataset
- Converting from numpy array to pandas dataset
- Plotting scatter plot for transformed data with respect to y