1/18
Flashcards based on Stony Brook University lecture slides by Dr. Steven Skiena on Clustering.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Unsupervised Learning
Methods that find structure in data by providing labels (clusters) or values (rankings) without a trusted standard.
Clustering
The problem of grouping points by similarity, often revealing underlying sources or explanations.
Similarity in Clustering
Defined by some underlying distance function/metric.
Natural Clusters
Compact, circular types.
Clustering Gene Expression Data
Groups genes active in the same phases of the cell cycle.
Biological Clustering
Associated with dendrograms or phylogenic trees.
Why use Clustering?
To determine how many distinct populations are in your data, build separate predictive models for each cluster, replace each cluster by its centroid, and detect outliers by distance from cluster centers.
K-Means Clustering
Pick k points as centers, assign examples to the nearest center, recalculate the center, and repeat until stable.
Local Optima
K-means can get stuck in these.
Centermost Input Example
Using the centermost input example as the center.
How to determine the "right" number of clusters
The SQE of points from their center should decrease slowly once exceeding the right number of clusters.
Limitations of K-means
Nested clusters, and long thin clusters.
Agglomerative Clustering
These bottom-up methods repeatedly merge the two nearest clusters.
Single-link clustering
Minimum Spanning Tree
Hierarchical Agglomerative Clustering
We start with every data point in a separate cluster and keep merging the most similar pairs of data points/clusters until we have one big cluster left.
Output of Hierarchical Clustering
A binary tree or dendrogram.
Dendrogram Height
The height of the bars indicate how close the items are.
Linkage Criteria
Nearest neighbor (single link, MST), Average link, Nearest centroid, Furthest link.
Advantages of Cluster Hierarchies
Organization of clusters and sub-clusters, visualization, natural measure of distance, and efficient classification of new items.