1/16
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Unsupervised learning
more challenging than supervised as no response means no goal, no way to check answers
more subjective: part of exploratory data analysis, techniques need to work in high dimensions
Clustering
Refers to techniques to find subgroups or clusters in the data
aim is to partition the observations into clusters so that observations in a cluster are similar but observations in different clusters are not
Need to specify what it means for observations to be similar/different
Measure the quality of clustering
Dissimilarity/Similarity metric:
expressed in terms of a distance function (different for different variables, e.g. real-value, boolean, categorical, ordinal, vector variables)
weights should be associated with different variables based on applications and data semantics
Quality of clustering
function to measure goodness. Hard to define “good enough” - subjective
Issues in clustering
many applications operate in a very high-dimensional space (almost all pairs of points are at about the same distance)
for the small number of dimensions and small amount of data, its easy but:
# clusters typically not known
exclusive vs non-exclusive clustering
clusters may be or arbitrary shapes and sizes
quality of clustering result
Clustering approaches (list)
K-means
Hierarchical
Convex
Gaussian mixture models
DBSCAN (density based spatial clustering of applications with noise) and variants
K-means algorithm
initialise C1…Ck by randomly assigning each observation a number from 1 to K
Repeat until the cluster assignments don’t change:
compute the centroid for each cluster
assign each observation to the cluster whose centroid is closest in Euclidean distance
The algorithm finds a local minimum of the objective function
Comments on K-means
have to predefine K: no guidance on how to choose K
K-means is based on spherical clusters, which might not always be appropriate
Sensitive to initial seeds, local minima
Sensitive to outliers (may put centroid far away)
Generalising the distance function is possible (K-medians)
Care needs to be taken in high dimensions; irrelevant features can conceal information about clusters. Idea of distance also breaks down- curse of dimensionality
dimension reduction prior to clustering is a good idea
Clustering Performance measures
Compactness: how tightly-packed a cluster is
clusters should be as compact as possible, so as to ensure that only the most related/similar instances have been grouped together
Separability: how well neighbouring clusters are separated in the feature space
Connectedness: instances that are close together should generally be allocated to the same cluster as they have similar characteristics
connectedness is generally measures per-instance rather than per-cluster. The most common approach used is to find the mean distance from each instance to its n-nearest neighbours
Hierarchical clustering
Aim is to get a hierarchy of clusters
don’t commit to the # clusters beforehand but we use a tree-based representation of the observations (dendrogram)
2 ways for hierarchical clustering
Agglomerative/Bottom-up clustering
start with the observations in n clusters - the leaves of the tree, then merge clusters - forming branches - until there is only 1 cluster, the trunk of the tree
More efficient than divisive
Divisive/Top-down
start with the observations in 1 clusters then split clusters until we reach the leaves
Dendrogram
The dissimilarity between 2 observations is related to the vertical height at which they first get merged into the same cluster
Greater the height, greater the dissimilarity
Dissimilarity and Linkage
2 key components of hierarchical clustering are a dissimilarity measure and a linkage method
Dissimilarity measure quantifies how dissimilar a pair of observations are
Linkage tells us how to extend the dissimilarity measure to pairs of clusters
Dissimilarity measures
Euclidean distance
Squared Euclidean distance
Manhattan distance
Maximum distance
Correlation based distance
Linkage methods
Quantifies the dissimilarity between clusters A and B
Centroid: dissimilarity between the centroid for cluster A (a mean vector of length p) and the centroid for cluster B.
Complete: compute the maximum pairwise dissimilarity where one observation is in cluster A and the other is in cluster B
Single: compute the minimum pairwise dissimilarity where one observation is in cluster A and the other is in cluster B.
Average: compute the average pairwise dissimilarity where one observation is in cluster A and the other is in cluster B
Centroid linkage can result in undesirable inversions
Complete and average linkage produce more balanced dendrograms
Single linkage can produce trailing clusters in which single observations are merged one-at-a-time
Agglomerative clustering algorithm
treat each observation as its own cluster, n clusters
Compute all pairwise dissimilarities (e.g. euclidean distance)
For i =n, n-1,…2
find pairs of clusters that are the least dissimilar and merge them (dissimilarity indicates the height on the dendrogram where the merge is shown)
compute all pairwise dissimilarities between the remaining clusters
Note no random initialisation, so agglomerative clustering is a deterministic algorithm
Drawback of hierarchical clustering
Clustering obtained by cutting the dendrogram at a certain height is necessarily nested within the clustering obtained by cutting at a greater height
Convex clustering
A penalised method for clustering that has elements of K-means and hierarchical clustering
(lambda = 0, every observation own cluster)
(lambda = infinity, just 1 big cluster)
As lambda increases, centroids fuse, the associated observations are now thought of as being in the same cluster, a type of agglomerative clustering
Weight important, they control the sensitivity to the local density of data