dimensionality reduction 1 (data mining)

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/62

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

63 Terms

1
New cards

What is dimensionality reduction?

A technique in data analysis that simplifies data by reducing the number of variables or dimensions while preserving the most crucial information from high-dimensional spaces.

2
New cards

What are the three key characteristics that dimensionality reduction should ensure?

  1. Preserving Relevance (retains crucial information) 2. Eliminating Redundancy (removes redundant/irrelevant attributes) 3. Transformation into lower-dimensional space (makes data manageable for analysis)

3
New cards

What are the four main motivations for dimensionality reduction?

  1. Improve model (performance and reduce overfitting) 2. Efficiency and Resource Management (computational and storage) 3. Interpretability and Understanding (visualization and enhanced interpretability) 4. Data quality enhancement (removal of redundancy and noise reduction)

4
New cards

What is the curse of dimensionality?

As the number of dimensions increases, the hypersphere volume becomes negligible compared to the hypercube volume, causing data points to become sparse and distance metrics to lose effectiveness.

5
New cards

How does the curse of dimensionality affect distance metrics?

In high dimensions, traditional distance metrics like Euclidean distance lose their effectiveness because points become dispersed, making most points seem equally far apart and posing challenges for distance-based analysis.

6
New cards

What are the two main categories of dimensionality reduction methods?

  1. Feature extraction (transforms original attributes into new ones - includes PCA and t-SNE) 2. Feature selection (selects subset of original attributes - includes filter and wrapper methods)

7
New cards

What is the difference between linear and non-linear feature extraction methods?

Linear methods (like PCA) transform original attributes into a new set of linear attributes. Non-linear methods (like t-SNE) transform original attributes into a new set of non-linear attributes.

8
New cards

What is the main goal of PCA?

Transform the original data into new variables (principal components) that follow the variation in the data, capturing maximum variance in the fewest dimensions.

9
New cards

What is the principle behind PCA?

PCA rotates the axis to align with the direction of maximum data variability, creating new orthogonal axes called principal components.

10
New cards

What are principal components?

New axes formed after rotation, where the first principal component (PC1) captures most of the variation, and subsequent components capture remaining variation while being orthogonal to previous ones.

11
New cards

What are the two equivalent mathematical formulations of PCA?

  1. Hotelling (1933): Maximizing Variance - express maximum variation within first component 2. Pearson (1901): Minimizing Error - minimize sum of projection errors on the principal component

12
New cards

Why are maximizing variance and minimizing projection error equivalent in PCA?

Because total variance (D₃²) = remaining variance (D₁²) + projection error (D₂²). Since total variance is fixed, maximizing D₁ automatically minimizes D₂.

13
New cards

What does D₁ represent in PCA?

The remaining variance - the length of projection of a data point onto the principal component, measuring how much variation is retained.

14
New cards

What does D₂ represent in PCA?

The projection error or lost variance - the perpendicular distance from a data point to the component line, measuring information lost during projection.

15
New cards

What does D₃ represent in PCA?

The original fixed variance - the total distance from origin to data point, representing total information in original data (independent of chosen component).

16
New cards

What is Step 1 of the PCA algorithm?

Preprocessing: Subtract the mean from each attribute (column) to center the data around zero, and scale attributes if their scales differ.

17
New cards

Why do we center the data in PCA?

Centering (subtracting mean) makes each column have mean 0, which simplifies the covariance formula and ensures the new coordinate system goes through the data's mean.

18
New cards

When should you scale attributes to unit variance in PCA?

When attributes have different scales or units. Scaling ensures PCA gives each feature equal importance and prevents features with larger ranges from dominating the analysis.

19
New cards

What is the formula for scaling to unit variance?

xscaled = (x - μx) / σx, where μx is the mean and σ_x is the standard deviation of feature x.

20
New cards

What is Step 2 of the PCA algorithm?

Calculate the covariance matrix (Σ), which measures how features vary together.

21
New cards

What is the formula for the covariance matrix in PCA?

C = (1/(n-1)) × X^T × X, where X is the centered data matrix (n×m), producing a symmetric m×m covariance matrix.

22
New cards

Why can we use matrix multiplication to calculate covariance after centering?

Because when data is centered (mean = 0), the covariance between features simplifies to Cov(xi, xj) = (1/(n-1))Σxki × xkj, which is exactly what X^T X computes for all feature pairs.

23
New cards

What is Step 3 of the PCA algorithm?

Compute eigenvalues and eigenvectors of the covariance matrix to find principal components through diagonalization.

24
New cards

What do eigenvectors represent in PCA?

Eigenvectors are the directions (principal components) along which the data varies the most - they define the new coordinate system axes.

25
New cards

What do eigenvalues represent in PCA?

Eigenvalues measure how much variance is captured along each corresponding eigenvector direction - larger eigenvalues indicate more important components.

26
New cards

What is Step 4 of the PCA algorithm?

Sort eigenvalues in descending order and select k < m eigenvectors (corresponding to top k eigenvalues) as principal components.

27
New cards

What is Step 5 of the PCA algorithm?

Reduce dimensionality by projecting data onto the k principal components: Xpca = X × Pk, where P_k is the m×k matrix of top k eigenvectors.

28
New cards

What are the dimensions of the projection result?

X_pca is an n×k matrix, where n is the number of objects and k is the reduced number of dimensions (k < m original dimensions).

29
New cards

How do you reconstruct an approximation of original data from reduced dimensions?

Xapprox = Xpca × Pk^T, where Xpca is n×k reduced data, P_k^T is k×m transpose of eigenvectors, producing n×m approximation.

30
New cards

Why doesn't multiplying by P_k^T cancel the previous projection?

Because when k < m, Pk × Pk^T is a projection matrix onto k-dimensional subspace, not the identity matrix. It maps back to original space but loses components along discarded eigenvectors.

31
New cards

What are the three main methods for choosing k in PCA?

  1. Percentage of variance retained (e.g., 99%) 2. Average squared projection error 3. Scree plot with elbow method
32
New cards

What is the formula for percentage of variance retained?

(Σλi for i=1 to k) / (Σλi for i=1 to p) ≥ threshold (e.g., 0.99), where λ_i are eigenvalues.

33
New cards

What does "99% of variance retained" mean?

Choose the smallest k such that the sum of the top k eigenvalues divided by the sum of all eigenvalues is at least 0.99, ensuring the reduced data captures 99% of original variation.

34
New cards

What is average squared projection error in terms of eigenvalues?

The sum of eigenvalues of discarded components: Error = Σλ_i for i=(k+1) to p. Smaller error means more information retained.

35
New cards

What is total variation in the data?

Total variance = Σλi for i=1 to p (sum of all eigenvalues), or equivalently (1/n)Σ||xi||² where ||x_i||² = sum of squared feature values for each data point.

36
New cards

What is a scree plot?

A 2D plot with component number (1, 2, 3, …, p) on x-axis and corresponding eigenvalue on y-axis, showing eigenvalues in descending order.

37
New cards

What is the elbow point in a scree plot?

The point where eigenvalues start dropping slowly (curve flattens), indicating that components beyond this point contribute very little variance and can be discarded.

38
New cards

How do you use the elbow method to choose k?

Identify the elbow point in the scree plot where the slope changes dramatically - choose k at or just before this point to capture most variance efficiently.

39
New cards

Why is using eigenvalues more efficient than computing X_approx for each k?

Because projection error equals the sum of discarded eigenvalues (Σλ_i for i>k), eliminating the need to reconstruct data for every k value - just compute eigenvalues once.

40
New cards

What is the relationship between the three methods for choosing k?

All three are equivalent perspectives: maximizing variance retained = minimizing projection error = identifying the elbow in eigenvalue decay.

41
New cards

What does ||x_i||² represent?

The squared Euclidean norm (length) of vector xi, calculated as xi1² + xi2² + … + xip² (sum of squares of all feature values).

42
New cards

What appears on the diagonal of the covariance matrix?

Variances of each feature: Cii = Var(Xi), showing how much each individual feature varies.

43
New cards

What appears on the off-diagonal of the covariance matrix?

Covariances between features: Cij = Cov(Xi, X_j), showing how pairs of features vary together.

44
New cards

What is the formula for covariance between two features?

Cov(Xi, Xj) = (1/(n-1)) Σ(xki - μi)(xkj - μj) for k=1 to n, where μ is the mean of each feature.

45
New cards

What does positive covariance indicate?

When Cov(Xi, Xj) > 0, the two features increase together - they are positively correlated.

46
New cards

What does negative covariance indicate?

When Cov(Xi, Xj) < 0, one feature increases while the other decreases - they are negatively correlated.

47
New cards

What does near-zero covariance indicate?

When Cov(Xi, Xj) ≈ 0, the two features are mostly unrelated or uncorrelated.

48
New cards

What is the relationship between covariance matrix eigenvalues and variance?

The sum of all eigenvalues equals the total variance in the data: Σλi = Σ||xi||²/n (sum of variances of all features).

49
New cards

In diagonalization, what is the D matrix?

A diagonal matrix consisting of eigenvalues (λ) along the diagonal, with all off-diagonal entries being zero.

50
New cards

What is the spectral theorem in PCA?

A theorem ensuring that eigenvectors of a symmetric covariance matrix are orthogonal (perpendicular), allowing data projection without correlation between new components.

51
New cards

What does it mean for principal components to be orthogonal?

Each component is perpendicular (uncorrelated) to all others, capturing independent directions of variation in the data.

52
New cards

What are the three main use cases for PCA in denoising?

Image denoising, enhancing data quality in various fields, by representing data as matrix, identifying PCs where high-variance components retain signal and low-variance components capture noise.

53
New cards

How does PCA work for face recognition?

Uses eigenfaces (eigenvalue/eigenvector decomposition of face images) to create reduced-dimension representation, then compares face features for recognition in security systems and biometric authentication.

54
New cards

How does PCA enable data visualization?

Reduces high-dimensional data to 2D or 3D by projecting onto lower-dimensional subspace while preserving data variance, enabling exploratory analysis and cluster visualization.

55
New cards

What are other applications of PCA? (Name at least 3)

Anomaly detection (identifying unusual patterns), data compression (reducing dimensionality), recommendation systems (extracting user preferences), genomic data analysis (identifying gene expression patterns).

56
New cards

Why is correlation common in real-world datasets?

Correlations often reflect complex relationships (e.g., weight-height correlation, spatial pixel correlations, study hours-test scores), indicating redundancy that PCA can address to enhance performance.

57
New cards

How does PCA handle correlated features?

PCA finds new axes (linear combinations like 0.7×height + 0.7×weight) that capture shared variance efficiently, transforming correlated features into uncorrelated principal components.

58
New cards

What is the geometric interpretation of PCA projection?

Like shining a flashlight on a data cloud from different angles - the first PC is the direction where the shadow (projection) has the widest spread (maximum variance).

59
New cards

In 2D, how can you describe the first principal component?

The line that either: (1) passes through the longest spread of the data cloud, or (2) minimizes perpendicular distances between itself and all data points - both views are equivalent.

60
New cards

What happens when a PCA component is perfectly parallel to the variance direction?

This is the ideal case: projection error (D₂) = 0, and 100% of variance is captured - the component is perfectly aligned with how the data naturally spreads.

61
New cards

Why do we need PCA when dealing with the curse of dimensionality?

In high dimensions, space becomes empty, distances become meaningless, and all points seem equally far. PCA finds the most important directions, making analysis reliable again.

62
New cards

What is the spring motion example demonstrating?

That 3 cameras recording (x,y) positions create 6D data, but the underlying physical motion is 1D (along spring). PCA extracts this hidden 1D dynamic from 6D recordings.

63
New cards

How does PCA reduce redundancy in the spring example?

Multiple cameras capture the same underlying motion from different angles (redundant information). PCA identifies that one dimension explains most variance, reducing 6D to 1D.