Clustering

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/16

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

17 Terms

1
New cards

Unsupervised learning

  • more challenging than supervised as no response means no goal, no way to check answers

  • more subjective: part of exploratory data analysis, techniques need to work in high dimensions

2
New cards

Clustering

  • Refers to techniques to find subgroups or clusters in the data

  • aim is to partition the observations into clusters so that observations in a cluster are similar but observations in different clusters are not

  • Need to specify what it means for observations to be similar/different

3
New cards

Measure the quality of clustering

Dissimilarity/Similarity metric:

  • expressed in terms of a distance function (different for different variables, e.g. real-value, boolean, categorical, ordinal, vector variables)

  • weights should be associated with different variables based on applications and data semantics

Quality of clustering

  • function to measure goodness. Hard to define “good enough” - subjective

4
New cards

Issues in clustering

  • many applications operate in a very high-dimensional space (almost all pairs of points are at about the same distance)

  • for the small number of dimensions and small amount of data, its easy but:

    • # clusters typically not known

    • exclusive vs non-exclusive clustering

    • clusters may be or arbitrary shapes and sizes

    • quality of clustering result

5
New cards

Clustering approaches (list)

  • K-means

  • Hierarchical

  • Convex

  • Gaussian mixture models

  • DBSCAN (density based spatial clustering of applications with noise) and variants

6
New cards

K-means algorithm

  1. initialise C1…Ck by randomly assigning each observation a number from 1 to K

  2. Repeat until the cluster assignments don’t change:

    • compute the centroid for each cluster

    • assign each observation to the cluster whose centroid is closest in Euclidean distance

The algorithm finds a local minimum of the objective function

7
New cards

Comments on K-means

  • have to predefine K: no guidance on how to choose K

  • K-means is based on spherical clusters, which might not always be appropriate

  • Sensitive to initial seeds, local minima

  • Sensitive to outliers (may put centroid far away)

  • Generalising the distance function is possible (K-medians)

  • Care needs to be taken in high dimensions; irrelevant features can conceal information about clusters. Idea of distance also breaks down- curse of dimensionality

    • dimension reduction prior to clustering is a good idea

8
New cards

Clustering Performance measures

  • Compactness: how tightly-packed a cluster is

    • clusters should be as compact as possible, so as to ensure that only the most related/similar instances have been grouped together

  • Separability: how well neighbouring clusters are separated in the feature space

  • Connectedness: instances that are close together should generally be allocated to the same cluster as they have similar characteristics

    • connectedness is generally measures per-instance rather than per-cluster. The most common approach used is to find the mean distance from each instance to its n-nearest neighbours

9
New cards

Hierarchical clustering

Aim is to get a hierarchy of clusters

don’t commit to the # clusters beforehand but we use a tree-based representation of the observations (dendrogram)

10
New cards

2 ways for hierarchical clustering

Agglomerative/Bottom-up clustering

  • start with the observations in n clusters - the leaves of the tree, then merge clusters - forming branches - until there is only 1 cluster, the trunk of the tree

  • More efficient than divisive

Divisive/Top-down

  • start with the observations in 1 clusters then split clusters until we reach the leaves

11
New cards

Dendrogram

  • The dissimilarity between 2 observations is related to the vertical height at which they first get merged into the same cluster

  • Greater the height, greater the dissimilarity

12
New cards

Dissimilarity and Linkage

  • 2 key components of hierarchical clustering are a dissimilarity measure and a linkage method

  • Dissimilarity measure quantifies how dissimilar a pair of observations are

  • Linkage tells us how to extend the dissimilarity measure to pairs of clusters

13
New cards

Dissimilarity measures

  • Euclidean distance

  • Squared Euclidean distance

  • Manhattan distance

  • Maximum distance

  • Correlation based distance

14
New cards

Linkage methods

Quantifies the dissimilarity between clusters A and B          

Centroid: dissimilarity between the centroid for cluster A (a mean vector of length p) and the centroid for cluster B.

Complete: compute the maximum pairwise dissimilarity where one observation is in cluster A and the other is in cluster B

Single: compute the minimum pairwise dissimilarity where one observation is in cluster A and the other is in cluster B.

Average: compute the average pairwise dissimilarity where one observation is in cluster A and the other is in cluster B

Centroid linkage can result in undesirable inversions

Complete and average linkage produce more balanced dendrograms

Single linkage can produce trailing clusters in which single observations are merged one-at-a-time

15
New cards

Agglomerative clustering algorithm

  1. treat each observation as its own cluster, n clusters

  2. Compute all pairwise dissimilarities (e.g. euclidean distance)

  3. For i =n, n-1,…2

    • find pairs of clusters that are the least dissimilar and merge them (dissimilarity indicates the height on the dendrogram where the merge is shown)

    • compute all pairwise dissimilarities between the remaining clusters

Note no random initialisation, so agglomerative clustering is a deterministic algorithm

16
New cards

Drawback of hierarchical clustering

Clustering obtained by cutting the dendrogram at a certain height is necessarily nested within the clustering obtained by cutting at a greater height

17
New cards

Convex clustering

A penalised method for clustering that has elements of K-means and hierarchical clustering

(lambda = 0, every observation own cluster)

(lambda = infinity, just 1 big cluster)

As lambda increases, centroids fuse, the associated observations are now thought of as being in the same cluster, a type of agglomerative clustering

Weight important, they control the sensitivity to the local density of data