1/13
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
What is the dimensionality of a data
analytics problem?
It is the dimensionality of the final feature representation of a dataset for machine learning modeling
What is the challenge of high
dimensionality of feature space for
machine learning?
significantly
increases the “search space” for finding a machine
learning model. This is often referred to as “the
curse of dimensionality”
Techniques that convert a high-dimension feature space to
a low-dimension feature representation:
PCA, Word2Vec, AutoEncoder
First Principal Component:
A vector/line that maximizes the
variance of data projected on the
line
Principal Component Analysis
Find a set of orthogonal vectors that
best capture the variance of the
data.
What does the first principal
component mean?
It is a vector/line that best characterizes the variations of
the data.
If we are limited to using only one dimension to represent
the data, the First Principal Component is the best choice.
Second Principal Component
A vector/line that is orthogonal to the first principal AND
it maximizes the variance of the data, when they are
projected to the line.
What does the second principal
component mean?
It is a vector/line that is orthogonal to the first principle,
and that best characterizes the remaining variations of
the data (beyond those captured by the first principle).
If we are limited to using only two dimensions to
represent the data, the First Principal Component and the
Second Principal Component is the best choice.
Principal Components are computed using…
Singular Value Decomposition (SVD)
Using PCA for Dimension Reduction
Decide the size of lower dimensions (we will denote is as L) to
which we want to map the data.
• Using the input data to find the first L Principal Components.
• Map the input data to the transformed (lower dimension)
space, formed by the L principal components
PCA in PySpark (ml.feature module) creates a PCA
template object that specifies
he number of principal components to form the reduced
dimension space.
The input column for the original features to be used to find
principal components.
The output column for the transformed features in the reduced
feature space (formed by the principal components).
How to find cluster center in one
original dimension for all of the clusters
formed in the PCA reduced dimension
filter for one binary feature (i.e., all scanners that scan a specific top port),
then use groupBy on ”pca_predication”, (i.e., on clusters formed in the PCA
reduced 35 dimensions),
followed by count().
How to compare clustering results with
and without PCA?
Compare Silhouette Sore of the approaches. Which one is
better?
Compare Mirai_ratio of clusters formed
Compare cluster centers formed
Pre-processing for PCA
input data needs to be normalized before
feeding it to PCA.
Otherwise, principal components identified can be
influenced by the scale of different dimensions