Clustering

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/16

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

17 Terms

New cards

Unsupervised learning

more challenging than supervised as no response means no goal, no way to check answers
more subjective: part of exploratory data analysis, techniques need to work in high dimensions

New cards

Clustering

Refers to techniques to find subgroups or clusters in the data
aim is to partition the observations into clusters so that observations in a cluster are similar but observations in different clusters are not
Need to specify what it means for observations to be similar/different

New cards

Measure the quality of clustering

Dissimilarity/Similarity metric:

expressed in terms of a distance function (different for different variables, e.g. real-value, boolean, categorical, ordinal, vector variables)
weights should be associated with different variables based on applications and data semantics

Quality of clustering

function to measure goodness. Hard to define “good enough” - subjective

New cards

Issues in clustering

many applications operate in a very high-dimensional space (almost all pairs of points are at about the same distance)
for the small number of dimensions and small amount of data, its easy but:
- # clusters typically not known
- exclusive vs non-exclusive clustering
- clusters may be or arbitrary shapes and sizes
- quality of clustering result

New cards

Clustering approaches (list)

K-means
Hierarchical
Convex
Gaussian mixture models
DBSCAN (density based spatial clustering of applications with noise) and variants

New cards

K-means algorithm

initialise C1…Ck by randomly assigning each observation a number from 1 to K
Repeat until the cluster assignments don’t change:
- compute the centroid for each cluster
- assign each observation to the cluster whose centroid is closest in Euclidean distance

The algorithm finds a local minimum of the objective function

New cards

Comments on K-means

have to predefine K: no guidance on how to choose K
K-means is based on spherical clusters, which might not always be appropriate
Sensitive to initial seeds, local minima
Sensitive to outliers (may put centroid far away)
Generalising the distance function is possible (K-medians)
Care needs to be taken in high dimensions; irrelevant features can conceal information about clusters. Idea of distance also breaks down- curse of dimensionality
- dimension reduction prior to clustering is a good idea

New cards

Clustering Performance measures

Compactness: how tightly-packed a cluster is
- clusters should be as compact as possible, so as to ensure that only the most related/similar instances have been grouped together
Separability: how well neighbouring clusters are separated in the feature space
Connectedness: instances that are close together should generally be allocated to the same cluster as they have similar characteristics
- connectedness is generally measures per-instance rather than per-cluster. The most common approach used is to find the mean distance from each instance to its n-nearest neighbours

New cards

Hierarchical clustering

Aim is to get a hierarchy of clusters

don’t commit to the # clusters beforehand but we use a tree-based representation of the observations (dendrogram)

New cards

2 ways for hierarchical clustering

Agglomerative/Bottom-up clustering

start with the observations in n clusters - the leaves of the tree, then merge clusters - forming branches - until there is only 1 cluster, the trunk of the tree
More efficient than divisive

Divisive/Top-down

start with the observations in 1 clusters then split clusters until we reach the leaves

New cards

Dendrogram

The dissimilarity between 2 observations is related to the vertical height at which they first get merged into the same cluster
Greater the height, greater the dissimilarity

New cards

Dissimilarity and Linkage

2 key components of hierarchical clustering are a dissimilarity measure and a linkage method
Dissimilarity measure quantifies how dissimilar a pair of observations are
Linkage tells us how to extend the dissimilarity measure to pairs of clusters

New cards

Dissimilarity measures

Euclidean distance
Squared Euclidean distance
Manhattan distance
Maximum distance
Correlation based distance

New cards

Linkage methods

Quantifies the dissimilarity between clusters A and B

Centroid: dissimilarity between the centroid for cluster A (a mean vector of length p) and the centroid for cluster B.

Complete: compute the maximum pairwise dissimilarity where one observation is in cluster A and the other is in cluster B

Single: compute the minimum pairwise dissimilarity where one observation is in cluster A and the other is in cluster B.

Average: compute the average pairwise dissimilarity where one observation is in cluster A and the other is in cluster B

Centroid linkage can result in undesirable inversions

Complete and average linkage produce more balanced dendrograms

Single linkage can produce trailing clusters in which single observations are merged one-at-a-time

New cards

Agglomerative clustering algorithm

treat each observation as its own cluster, n clusters
Compute all pairwise dissimilarities (e.g. euclidean distance)
For i =n, n-1,…2
- find pairs of clusters that are the least dissimilar and merge them (dissimilarity indicates the height on the dendrogram where the merge is shown)
- compute all pairwise dissimilarities between the remaining clusters

Note no random initialisation, so agglomerative clustering is a deterministic algorithm

New cards

Drawback of hierarchical clustering

Clustering obtained by cutting the dendrogram at a certain height is necessarily nested within the clustering obtained by cutting at a greater height

New cards

Convex clustering

A penalised method for clustering that has elements of K-means and hierarchical clustering

(lambda = 0, every observation own cluster)

(lambda = infinity, just 1 big cluster)

As lambda increases, centroids fuse, the associated observations are now thought of as being in the same cluster, a type of agglomerative clustering

Weight important, they control the sensitivity to the local density of data