Topic 11 Dimension Reduction PCA

0.0(0)
Studied by 0 people
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/13

encourage image

There's no tags or description

Looks like no tags are added yet.

Last updated 5:59 PM on 4/20/26
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

14 Terms

1
New cards

What is the dimensionality of a data

analytics problem?

It is the dimensionality of the final feature representation of a dataset for machine learning modeling

2
New cards

What is the challenge of high

dimensionality of feature space for

machine learning?

significantly

increases the “search space” for finding a machine

learning model. This is often referred to as “the

curse of dimensionality”

3
New cards

Techniques that convert a high-dimension feature space to

a low-dimension feature representation:

PCA, Word2Vec, AutoEncoder

4
New cards

First Principal Component:

A vector/line that maximizes the

variance of data projected on the

line

5
New cards

Principal Component Analysis

Find a set of orthogonal vectors that

best capture the variance of the

data.

6
New cards

What does the first principal

component mean?

It is a vector/line that best characterizes the variations of

the data.

If we are limited to using only one dimension to represent

the data, the First Principal Component is the best choice.

7
New cards

Second Principal Component

A vector/line that is orthogonal to the first principal AND

it maximizes the variance of the data, when they are

projected to the line.

8
New cards

What does the second principal

component mean?

It is a vector/line that is orthogonal to the first principle,

and that best characterizes the remaining variations of

the data (beyond those captured by the first principle).

If we are limited to using only two dimensions to

represent the data, the First Principal Component and the

Second Principal Component is the best choice.

9
New cards

Principal Components are computed using…

Singular Value Decomposition (SVD)

10
New cards

Using PCA for Dimension Reduction

Decide the size of lower dimensions (we will denote is as L) to

which we want to map the data.

• Using the input data to find the first L Principal Components.

• Map the input data to the transformed (lower dimension)

space, formed by the L principal components

11
New cards

PCA in PySpark (ml.feature module) creates a PCA

template object that specifies

he number of principal components to form the reduced

dimension space.

The input column for the original features to be used to find

principal components.

The output column for the transformed features in the reduced

feature space (formed by the principal components).

12
New cards

How to find cluster center in one

original dimension for all of the clusters

formed in the PCA reduced dimension

filter for one binary feature (i.e., all scanners that scan a specific top port),

then use groupBy on ”pca_predication”, (i.e., on clusters formed in the PCA

reduced 35 dimensions),

followed by count().

13
New cards

How to compare clustering results with

and without PCA?

Compare Silhouette Sore of the approaches. Which one is

better?

Compare Mirai_ratio of clusters formed

Compare cluster centers formed

14
New cards

Pre-processing for PCA

input data needs to be normalized before

feeding it to PCA.

Otherwise, principal components identified can be

influenced by the scale of different dimensions