K-Means and Hierarchical Clustering

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/28

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

29 Terms

New cards

K-means

Iterative greedy descent algorithm that finds a sub-optimal solution to

New cards

K-means iteratively alternates between the following two steps:

• Assignment step: For given set of K cluster centroids, K-means assigns each example to the cluster with closest centroid.

* fix centroids and optimize cluster assignments (optimizes the red highlighted part).(optimizes the red highlighted part)

• Refitting step: Re-evaluate and update the cluster centroids, i.e., for fixed cluster assignment, optimize the centroids

New cards

K-Means Algorithm

New cards

Space Complexity of K-Means

Space requirement for K-means is modest because only data observations and centroids are stored

Storage complexity is of the order O((N+ K)m), where m is the number of feature attributes

New cards

Time Complexity of K-Means

ime complexity of K-means: O(I\*K\*N\*m) where I is the number of iterations required for convergence

Importantly, time complexity of K-means is linear in N.

New cards

Does the K-means algorithm always converge?

New cards

Can it always find optimal clustering?

New cards

How should we choose the number of clusters?

Run multiple K-means algorithm starting from randomly chosen cluster centroids. Choose the cluster assignment that has the minimum dissimilarity. using the elbow method

New cards

K-means++

\- Choose first centroid at random.

\- For each data point x compute its distance dist(x) from the nearest centroid.

\- Choose a data point x randomly with probability proportional to dist(x)^2 as the next centroid.

- Continue until K cluster centroids are obtained.

- Use the obtained K centroids as initial centroids for the K-means algorithm

New cards

Elbow method

o Apply K-means algorithm multiple times with different number of clusters.

o Evaluate the quality of the obtained clustering structure in each run of the algorithm using the metric dissimilarity(C) .

o As the number of clusters increases, dissimilarity(C) tends to decrease.

o Plot dissimilarity(C) as a function of the number K of clusters.

o Optimal K\* lies at the elbow of the plot.

$o Apply K-means algorithm multiple times with different number of clusters. o Evaluate the quality of the obtained clustering structure in each run of the algorithm using the metric dissimilarity(C) . o As the number of clusters increases, dissimilarity(C) tends to decrease. o Plot dissimilarity(C) as a function of the number K of clusters. o Optimal K\* lies at the elbow of the plot.$

New cards

hierarchical clustering

user specifies a measure of similarity (or dissimilarity) between a pair of clusters - it creates a hierarchical decomposition of the set of examples using a user-specified criterion and produces a dendrogram

New cards

Dendrogram

New cards

Strategies for Hierarchical Clustering

Agglomerative Clustering and Divisive Clustering

New cards

Agglomerative Clustering

* Bottom-up approach
* Starts at the bottom with each cluster containing a single observation
* At each level up, recursively merge pair of clusters with the smallest inter-cluster dissimilarity into a single cluster.
* A single cluster at the top level

New cards

Divisive Clustering

* Top-down approach
* Starts at the top with a single cluster of all observations
* At each level down, recursively split one of the existing clusters into two new clusters with the largest inter-cluster dissimilarity.
* At the bottom, each cluster contains single observation

New cards

Agglomerative Clustering Algorithm

1\. Start with all data points in their own clusters.

2. Repeat until only one cluster remains:

* Find 2 clusters C1, C2 that are most similar (i.e., that have the smallest inter-cluster dissimilarity d(C1,C2) )
* Merge C1,C2 into one cluster

Output: a dendrogram

Reply on: an inter-cluster dissimilarity metric

$1\. Start with all data points in their own clusters. 2. Repeat until only one cluster remains: * Find 2 clusters C1, C2 that are most similar (i.e., that have the smallest inter-cluster dissimilarity d(C1,C2) ) * Merge C1,C2 into one cluster Output: a dendrogram Reply on: an inter-cluster dissimilarity metric$

New cards

Measures of Inter-Cluster Dissimilarity

Single linkage, Complete linkage and Group average

New cards

Single linkage

Shortest distance from any member of the cluster to any member of the other cluster

New cards

Complete linkage

Largest distance from any member of the cluster to any member of the other cluster

New cards

Group average

Average of distances between members of the two clusters

New cards

Does the choice of inter-cluster dissimilarity measure matter?

• Yes !!!

• Yields similar results when the (natural) clusters are compact and well-separated

New cards

Single linkage advantage and disadvantage

New cards

Complete linkage advantage and disadvantage

• Requires all examples in the two clusters to be relatively similar

• Produces compact clusters with small diameters

• Robust to outliers

• However, members can be closer to other clusters than they are to members of their own clusters

New cards

Group average linkage advantage and disadvantage

• Attempts to produce relatively compact clusters that are relatively far apart

• Depends on the numerical scale on which the distances are measured

New cards

Space complexity of Hierarchical Clustering

New cards

Time complexity of Hierarchical Clustering

New cards

Characteristics of Hierarchical Clustering

* Lack of a global objective function

• Need not solve hard combinatorial optimization problem as in K-means

• No issues with local minima or choosing initial points

* Deterministic algorithm
* Merging decisions are final
* May impose a hierarchical structure on an otherwise unhierarchical data

New cards

Advantages of K-means

• Optimizes a global objective function

• Squared Euclidean distance based

• Non-deterministic

New cards

disadvantages of k-means

• Requires as input: number of clusters and an initial choice of centroids

• Convergence to local minima implies multiple restarts