Clustering Lecture Flashcards

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/18

Earn XP

Description and Tags

Flashcards based on Stony Brook University lecture slides by Dr. Steven Skiena on Clustering.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

19 Terms

New cards

Unsupervised Learning

Methods that find structure in data by providing labels (clusters) or values (rankings) without a trusted standard.

New cards

Clustering

The problem of grouping points by similarity, often revealing underlying sources or explanations.

New cards

Similarity in Clustering

Defined by some underlying distance function/metric.

New cards

Natural Clusters

Compact, circular types.

New cards

Clustering Gene Expression Data

Groups genes active in the same phases of the cell cycle.

New cards

Biological Clustering

Associated with dendrograms or phylogenic trees.

New cards

Why use Clustering?

To determine how many distinct populations are in your data, build separate predictive models for each cluster, replace each cluster by its centroid, and detect outliers by distance from cluster centers.

New cards

K-Means Clustering

Pick k points as centers, assign examples to the nearest center, recalculate the center, and repeat until stable.

New cards

Local Optima

K-means can get stuck in these.

New cards

Centermost Input Example

Using the centermost input example as the center.

New cards

How to determine the "right" number of clusters

The SQE of points from their center should decrease slowly once exceeding the right number of clusters.

New cards

Limitations of K-means

Nested clusters, and long thin clusters.

New cards

Agglomerative Clustering

These bottom-up methods repeatedly merge the two nearest clusters.

New cards

Single-link clustering

Minimum Spanning Tree

New cards

Hierarchical Agglomerative Clustering

We start with every data point in a separate cluster and keep merging the most similar pairs of data points/clusters until we have one big cluster left.

New cards

Output of Hierarchical Clustering

A binary tree or dendrogram.

New cards

Dendrogram Height

The height of the bars indicate how close the items are.

New cards

Linkage Criteria

Nearest neighbor (single link, MST), Average link, Nearest centroid, Furthest link.

New cards

Advantages of Cluster Hierarchies

Organization of clusters and sub-clusters, visualization, natural measure of distance, and efficient classification of new items.