6 - Unsupervised Learning

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/64

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

65 Terms

1
New cards

Uses of unsupervised learning techniques

Exploratory data analysis especially on high dimensional datasets

Feature generation

2
New cards

PCA

To transform a high-dimensional dataset into  a smaller, much more manageable set of representative (“principal”) variables, easier to explore and visualize

3
New cards

PCs are…

linear combinations of the existing variables that capture most of the information in the original dataset

4
New cards

PCA is especially useful for

highly correlated data, for which a few PCs are enough to represent most of the information in the full dataset

5
New cards

PCA score formula

knowt flashcard image
6
New cards

Choose PC loadings to…

capture as much information the original dataset as possible

7
New cards

Goal of PCA for calculations

to maximize the sample variance of Z1

8
New cards

Sample variance formula

knowt flashcard image
9
New cards

Orthogonality constraints

knowt flashcard image
10
New cards

Why are orthogonality constraints needed

so the PCs measure different aspects of the variables in the dataset

11
New cards

PCA analysis is constrained by

the line that is as close as possible to the observations such that minimizes the sum of the squared perpendicular distances between each data point and the line

12
New cards

First and second PCs are…

Mutually perpendicular

13
New cards

Scores formula with notation

knowt flashcard image
14
New cards

Total variance for a feature

knowt flashcard image
15
New cards

Variance explained by the mth PC

knowt flashcard image
16
New cards

PVE =

knowt flashcard image
17
New cards

Does Centering have an effect on PC loadings? Why?

Centering has no effect on PC loadings because the variance remains unchanged upon centering and this variance maximization is how PC loadings are defined

18
New cards

Does scaling have an effect on PC loadings? Why?

Yes, If variables are of vastly different orders of magnitude, then variables with an unusually large variance on their scale will receive a large PC loading and dominate the corresponding PC

19
New cards

Drawbacks of PCA

  • Interpretability: Cannot make sense of the PCs because they are complicated linear combinations of the original features

  • Not good for non-linear relationships

    • Uses linear transformations to summarize and visualize high-dimensional datasets where the variables are highly linearly correlates

  • PCA is not doing feature selection because all variables go into the components, no operational efficiency added

  • Target variable is ignored: Assuming directions in which the features exhibit the most variation are also the directions most associated with the target variable, no guarantee this is true

20
New cards

Scree plot

PVEs against PC index

21
New cards

Are PC loadings unique

Yes, up to a sign flip

22
New cards

Biplot shows what?

the locations of PCAs and scores

23
New cards

How does having categorical variables with high dimensionality hurt a data set?

Predictive power of the model will be diluted and lead to sparse factor levels (those with very few observations)

24
New cards

Suggest 2 ways to transform categorical variables with high dimensionality to retain them

Combine categories into smaller groups

Binarize the two factor variables, run a PCA on each set of dummy variables and use the first few PCs to summarize most of the information

25
New cards

Total SS =

within cluster SS + between cluster SS

26
New cards

Total SS is

the total variation of all the observations in the data without any clustering (essentially there is one large cluster containing all observations)

27
New cards

Between cluster SS

Can be thought of as the SS explained by the K clusters

28
New cards

Two idealistic goals of cluster analysis

  • Homogeneity: Want observations within each cluster to share characteristics while observations in different clusters are different from one another

  • Interpretability: Characteristics of the clusters are typically interpretable and meaningful within the context of the business problem

29
New cards

PCA/Clustering Similarities

  • Unsupervised Learning

  • Simplify the data by a small number of summaries

30
New cards

PCA/Clustering Difference

PCAs find low dim representation whereas clustering finds homogeneous subgroups among the obs

31
New cards

K-means clustering algorithm

  • Randomly select k points in the feature space as the initial cluster centers

  • Assign each obs to closest cluster in terms of Euclidean distance

  • Recalc center of each cluster

  • Repeat until nothing changes

32
New cards

Why we run K means algorithm multiple times

To mitigate the randomness associated with the initial cluster centers and increase the chance of identifying a global optimum and getting more representative cluster groups

33
New cards

Is K means clustering a global optimum

No, local

34
New cards

Hierarchical Clustering

Series of fusions of observations

35
New cards

Hierarchical clustering algorithm

  • Start with all separate clusters

  • Fuse closest pair one at a time

    • Repeat until all clusters are fused into a single cluster containing all obs

36
New cards

Within cluster variation vs Euclidian distance

Within cluster is squared for each observation whereas Euclidean distance uses a square root

37
New cards

Elbow method

Choose cutoff where the proportion of variance explained by the k number of clusters reaches the elbow in the graph

38
New cards

Linkage choices

Complete, single, average, centroid

39
New cards

Complete linkage

knowt flashcard image
40
New cards

Single linkage

Minimal pairwise distance

41
New cards

Average linkage

Average of all pairwise distances

42
New cards

Centroid linkage

Distance between the two cluster centroids (or centers)

43
New cards

Most common linkage methods? Why?

Complete and average because they result in more balanced and visually appealing clusters

44
New cards

Dendrogram

an upside down tree that shows dissimilarity at each fusion

45
New cards

Lower cut dendrogram results in _____ clusters

More

46
New cards

Differences of K means and hierarchical

Randomization

Pre-specified number of clusters

Nested clusters

47
New cards

Which of K means and hierarchical need randomization?

K means

48
New cards

Which of K means and hierarchical need pre-specified clusters?

K means

49
New cards

Which of K means and hierarchical need nested clusters?

Hierarchical

50
New cards

Similarities of K means and hierarchical

  • Both unsupervised

  • Objective is to uncover homogeneous subgroups among the observations

  • Both are sensitive to scaling of variables

  • Both are sensitive to outliers

51
New cards

Solution for observations with largely different scales

correlation based distance

52
New cards

Ways to generate features from cluster analysis

  • Cluster groups

  • Cluster centers can replace the original variables for interpretation and prediction purposes

53
New cards

Two impacts of Curse of dimensionality for clustering

Harder to visualize data

Notion of closeness becomes more fuzzy when there are more and more variables

54
New cards

Which linkage can results in inversion

Central linkage

55
New cards

Considerations for number of clusters to choose for hierarchical clustering

Balance

Height differences

56
New cards

Explain two reason why unsupervised learning is often more challenging than supervised learning

Less clearly defined objectives

Less objective evaluation

57
New cards

Why are PC loading vectors unique up to a sign flip

The line of the PC extends in both directions and therefore gives rise to another valid PC loading vector

58
New cards

Explain how scaling the variables will effect the results of clustering

Unscaled variables might have one dominate the distance calculations and exert a disproportionate impact on the cluster arrangements, so we adjust for that

59
New cards

Explain how principal components analysis can be used as a pre-processing step before applying clustering to a high-dimensional dataset

PCA can allow us to compress the data into 2 dimensions without losing much information and to visualize the cluster assignments in a two-dimensional scatterplot using the scores of the first two PCs

60
New cards

Large variance for PC1 and small others imply what?

Strong correlation among the variables

61
New cards

What is the name of the plot for K means clustering

Elbow plot

62
New cards

When to select for an elbow plot

When it levels off

63
New cards

State the difference between dissimilarity and linkage

Dissimilarity measures the proximity of two observations in the data set, while linkage measures the proximity of two clusters of observations

64
New cards

Describe the steps to calculate the within cluster sum of squares using latitude and longitude

Calc the centroid

Calc sq Euclidean distance between each city and the respective centroid

Sum all sqd distances

65
New cards

What distance method does K means clustering and hierarchical clustering use

Euclidean