Topic 11 Dimension Reduction PCA

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/13

There's no tags or description

Looks like no tags are added yet.

Last updated 5:59 PM on 4/20/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

14 Terms

New cards

What is the dimensionality of a data

analytics problem?

It is the dimensionality of the final feature representation of a dataset for machine learning modeling

New cards

What is the challenge of high

dimensionality of feature space for

machine learning?

significantly

increases the “search space” for finding a machine

learning model. This is often referred to as “the

curse of dimensionality”

New cards

Techniques that convert a high-dimension feature space to

a low-dimension feature representation:

PCA, Word2Vec, AutoEncoder

New cards

First Principal Component:

A vector/line that maximizes the

variance of data projected on the

line

New cards

Principal Component Analysis

Find a set of orthogonal vectors that

best capture the variance of the

data.

New cards

What does the first principal

component mean?

It is a vector/line that best characterizes the variations of

the data.

If we are limited to using only one dimension to represent

the data, the First Principal Component is the best choice.

New cards

Second Principal Component

A vector/line that is orthogonal to the first principal AND

it maximizes the variance of the data, when they are

projected to the line.

New cards

What does the second principal

component mean?

It is a vector/line that is orthogonal to the first principle,

and that best characterizes the remaining variations of

the data (beyond those captured by the first principle).

If we are limited to using only two dimensions to

represent the data, the First Principal Component and the

Second Principal Component is the best choice.

New cards

Principal Components are computed using…

Singular Value Decomposition (SVD)

New cards

Using PCA for Dimension Reduction

Decide the size of lower dimensions (we will denote is as L) to

which we want to map the data.

• Using the input data to find the first L Principal Components.

• Map the input data to the transformed (lower dimension)

space, formed by the L principal components

New cards

PCA in PySpark (ml.feature module) creates a PCA

template object that specifies

he number of principal components to form the reduced

dimension space.

The input column for the original features to be used to find

principal components.

The output column for the transformed features in the reduced

feature space (formed by the principal components).

New cards

How to find cluster center in one

original dimension for all of the clusters

formed in the PCA reduced dimension

filter for one binary feature (i.e., all scanners that scan a specific top port),

then use groupBy on ”pca_predication”, (i.e., on clusters formed in the PCA

reduced 35 dimensions),

followed by count().

New cards

How to compare clustering results with

and without PCA?

Compare Silhouette Sore of the approaches. Which one is

better?

Compare Mirai_ratio of clusters formed

Compare cluster centers formed

New cards

Pre-processing for PCA

input data needs to be normalized before

feeding it to PCA.

Otherwise, principal components identified can be

influenced by the scale of different dimensions