Multivariate Data Analysis - Principal Component Analysis

Variance and Covariance

  • Variance estimates how far a set of numbers are spread out from their mean value.
  • Covariance measures the relationship between the variance of two variables.
  • Covariance indicates how two variables are related and whether they vary together.
  • Covariance is denoted as Cov(X,Y)Cov(X, Y). The formulas are as follows:

Covariance Formulas

  • Given two sets of data:
    • X=x<em>1,,x</em>i,,xnX = {x<em>1, …, x</em>i, …, x_n}
    • Y=y<em>1,,y</em>i,,ynY = {y<em>1, …, y</em>i, …, y_n}
  • Covariance can be calculated as:
    • Cov(X,Y)=<em>i=1n(x</em>iμ<em>x)(y</em>iμy)nCov(X,Y) = \frac{\sum<em>{i=1}^{n} (x</em>i - \mu<em>x)(y</em>i - \mu_y)}{n}
    • Where μ<em>x\mu<em>x and μ</em>y\mu</em>y are the means of X and Y respectively.
  • Simplified formula:
    • Cov(X,Y)=<em>i=1nx</em>iy<em>inμ</em>xμyCov(X,Y) = \frac{\sum<em>{i=1}^{n} x</em>i y<em>i}{n} - \mu</em>x \mu_y
  • Covariance is symmetric:
    • Cov(X,Y)=Cov(Y,X)Cov(X,Y) = Cov(Y,X)

Correlation

  • Correlation is denoted as ρ(X,Y)\rho(X, Y).
  • Formula:
    • ρ(X,Y)=Cov(X,Y)σ<em>xσ</em>y\rho(X,Y) = \frac{Cov(X, Y)}{\sigma<em>x \sigma</em>y}
    • Where σ<em>x\sigma<em>x and σ</em>y\sigma</em>y are the standard deviations of X and Y respectively.
  • Correlation is symmetric:
    • ρ(X,Y)=ρ(Y,X)\rho(X,Y) = \rho(Y, X)
  • Pearson Product-Moment Correlation Coefficient is mentioned.

Variance and Covariance in NumPy

  • A non-singular matrix Mat is created using numpy:
    Mat=np.asmatrix(np.array([[8,-6,2], [-6,7,-4], [2, -4,1]])) print (Mat) print (type (Mat))
  • The type of Mat is a numpy.matrix.
  • Covariance between each column of Mat is calculated using np.cov():
    print (np.cov (Mat))
  • Pearson Correlation coefficient is calculated using np.corrcoef():
    print (np.corrcoef (Mat))

Covariance Matrix

  • For a data set X=x<em>1,,x</em>i,,xnX = {x<em>1, …, x</em>i, …, x_n}, where XR2X \in R^2, and
    • X<em>1=x</em>1[1],,x<em>i[1],,x</em>n[1]X<em>1 = {x</em>1[1], …, x<em>i[1], …, x</em>n[1]}
    • X<em>2=x</em>1[2],,x<em>i[2],,x</em>n[2]X<em>2 = {x</em>1[2], …, x<em>i[2], …, x</em>n[2]}
  • The covariance matrix C is defined as:
    • C=[Var(X<em>1)Cov(X</em>1,X<em>2) Cov(X</em>2,X<em>1)Var(X</em>2)]C = \begin{bmatrix} Var(X<em>1) & Cov(X</em>1, X<em>2) \ Cov(X</em>2, X<em>1) & Var(X</em>2) \end{bmatrix}

Coordinate Transformations

Translation

  • Origins of frame {A} and {B} do not coincide.
  • Corresponding axes of {A} and {B} are parallel.
  • Relationship between a point P in frame {A} and frame {B}:
    • BP=APAOB^B P = ^A P - ^A O_B
    • Where AOB^A O_B is the origin of frame {B} expressed in frame {A}.
  • Example:
    • If AP=(10,10)^A P = (10, 10) and AOB=(3,5)^A O_B = (3, 5), then BP=(103,105)=(7,5)^B P = (10-3, 10-5) = (7, 5).

Rotation

  • A point P has coordinates (u, v) in frame {A} and (x, y) in frame {B}.
  • The relationship between (u, v) and (x, y) is given by:
    • u=xcos(θ)ysin(θ)u = x \cos(\theta) - y \sin(\theta)
    • v=xsin(θ)+ycos(θ)v = x \sin(\theta) + y \cos(\theta)
  • This can be expressed in matrix form as:
    • [u v]=[cos(θ)amp;sin(θ) sin(θ)amp;cos(θ)][x y]\begin{bmatrix} u \ v \end{bmatrix} = \begin{bmatrix} \cos(\theta) &amp; -\sin(\theta) \ \sin(\theta) &amp; \cos(\theta) \end{bmatrix} \begin{bmatrix} x \ y \end{bmatrix}
  • The rotation matrix from frame {B} to frame {A} is:
    • ARB=[cos(θ)amp;sin(θ) sin(θ)amp;cos(θ)]^A R_B = \begin{bmatrix} \cos(\theta) &amp; -\sin(\theta) \ \sin(\theta) &amp; \cos(\theta) \end{bmatrix}
  • The rotation matrix from frame {A} to frame {B} is the transpose of ARB^A R_B:
    • BR<em>A=(AR</em>B)T=[cos(θ)amp;sin(θ) sin(θ)amp;cos(θ)]^B R<em>A = (^A R</em>B)^T = \begin{bmatrix} \cos(\theta) &amp; \sin(\theta) \ -\sin(\theta) &amp; \cos(\theta) \end{bmatrix}

Principal Component Analysis (PCA)

  • PCA is a statistical procedure to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
  • Each principal component is chosen to describe the most available variance.
  • All principal components are orthogonal to each other.
  • The first principal component has the maximum variance.
  • PCA is a non-dependent method used for reducing attribute space from a larger number of variables to a smaller number of factors.
  • It is a dimension reducing technique but with no assurance whether the dimension would be interpretable.
  • In PCA, the main job is selecting the subset of variables from a larger set, depending on which original variables will have the highest correlation with the principal amount.

Mathematical Formulation of PCA

  • Let the data matrix X be of n×p size, where n is the number of samples and p is the number of variables.
  • Assume that X is centered (column means have been subtracted and are now equal to zero).
  • The p×p covariance matrix C is given by:
    • C=XTXn1C = \frac{X^T X}{n-1}.
  • C is a symmetric matrix and can be diagonalized:
    • C=EΛETC = E \Lambda E^T
    • Where E is a matrix of eigenvectors (each column is an eigenvector) and Λ is a diagonal matrix with eigenvalues λ\lambda in decreasing order on the diagonal.
  • The eigenvectors are called principal axes or principal directions of the data.
  • Projections of the data on the principal axes are called principal components, also known as PC scores; these can be seen as new, transformed, variables.

Data Set used for PCA

  • Sklearn.dataset package has a dataset on iris flowers
  • Dataset have 4 parameters sepal length sepal width petal length petal width
  • y has categorical data weather flower belongs to setosa, versicolor or virginica family
  • Iris setosa, Iris versicolor, Iris virginica are the three species.

Implementation using NumPy

  • Calculate covariance matrix:

    • co = np.cov(X.T)
  • Calculate correlation matrix:

    • cor = np.corrcoef(X.T)
  • Calculate eigenvalues and eigenvectors of the covariance matrix:

    • eig_val, eig_vec = np.linalg.eig(co)
  • Project dataset on respective eigenvector to have principal axis.

  • Sort eigenvectors with respect to eigenvalues.

  • Identify principal components which retain up to 95% of information.

Implementation using Pandas

  • Convert the principal component vectors to a Pandas DataFrame:

    • df = pd.DataFrame(Q_hat)
  • Plotting scatter plot in reference with y.

Implementation using Scikit-learn

  • Using in built PCA function from sklearn.decomposition library package
  • Model Generated to return 2 components after PCA
  • Fitting the model to our dataset
  • Converting from numpy array to pandas dataset
  • Plotting scatter plot for transformed data with respect to y