CS363M Flashcards Lecture 12-14

0.0(0)
Studied by 1 person
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/26

flashcard set

Earn XP

Description and Tags

Clustering Techniques

Last updated 4:42 PM on 4/1/26
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

27 Terms

1
New cards

Unsupervised ML

machine learning where there are no target class labels

2
New cards

Clustering

ML technique where data points are divided into groups that are potentially meaningful or useful; items inside cluster should be similar to each other, but different than other clusters

3
New cards

K-means algorithm

  1. Choose k random points from training data to be initial centroids

  2. Each data point assigned to closest centroid to form clusters

  3. Update centroid of each cluster based on points assigned to that cluster

  4. Re-assign points to closest centroid

  5. Repeat until no points change cluster (or clusters keep rotating)

4
New cards

What is the goal when using Euclidean distance as a proximity measure for the k-means algorithm

Minimize the SSE (aka scatter)

5
New cards

What are some ways you can respond to an empty centroid during a k-means algorithm?

  • Pick datapoint furthest from its centroid

  • Pick random datapoint (again)

  • Pick random datapoint within the cluster with the highest SSE

6
New cards

K-means++

Centroid initialization algorithm (run before k-means

  • Choose a random centroid

  • Repeat the following until there are k centroids

    • Compute distance to every point to its nearest centroid

    • Use squared-distance as a probablility distribution for choosing the next centroid (farther are more likely to be selected)

7
New cards

Why do we not always want to choose the farthest point during the k-means++ algorithm?

  • farthest point could be an outlier

  • deterministic algorithms with suboptimal solution will always end up at the same suboptimal solution

8
New cards

Bisecting K-means

  • To obtain k clusters, split the set of all data points into 2 clusters using k-means with k = 2

  • Choose one of the clusters to split

  • Split the chosen cluster using k-means until k=2

  • Continue until you have k clusters

  • Less susceptible to initialization problems than regular k-means; more likely to reach optimal solution than regular k-means and also faster

9
New cards

How can we choose the right k for k-means

  • Try different k’s and evaluate the results

  • Dataset / Domain knowledge

  • Use case (ie 5 clusters if you only want to send 5 emails)

  • Elbow Method — graph the SSEs as k increases look for points where the SSE values start to flatten (ie converge toward 0)

    • Note: if k = <size of dataset>, then SSE=0 but we have “overfitting” (which doesn’t really exist with unsupervised learning, but for lack of a better term…)

10
New cards

Characteristics of k-means

  • Pro: Simple; can be used for wide variety of data types

  • Pro: Quite efficient, even when run multiple times

  • Variations exist that are more efficient and less susceptible to initialization problems

  • Outliers can alter results, but outlier detection and remove can be done before clustering

  • Con: Susceptible to the Curse of Dimensionality

  • Feature selection / feature engineering is critical

  • Con: Algorithm has difficulties with non-globular (ie non-spherical) data, clusters of different sizes, and clusters of different densitiesj

11
New cards

What is one way we can maybe make k-means work better on non-globular data? What is one limitation of this strategy?

  • Increase the number of clusters

  • There is no way to know if different clusters are next to each other / contigous

12
New cards

density-based clustering

locates regions of high density that are separated by regions of low density

13
New cards

Center-based density

i.e. density is calculated for a particular point in a data set by counting the number of points within a specified radius (Eps) of that point (count includes the point itself)

14
New cards

Core points

points on the interior of a dense region; points with at least minPts neighbors within Eps radius of the point

15
New cards

Border points

points on the border of a dense region; point that is not a core point, but falls in the neighborhood (ie within Eps radius) of a core point

16
New cards

Noise points

points in sparse regions; points that are neither core nor border points

17
New cards

DBSCAN

  • Classify all points as core, border, or noise

  • Label noise points

  • Any two core points that are within Eps of each other are put into the same cluster

  • Any border point that is within Eps of a core point is put into the same cluster as the core point (ties may need to be resolved)

18
New cards

directly density reachable

when q is in the Eps-neighborhood of p and p is a core point

( ie q is directly density reachable from p)

19
New cards

density reachable

when there is a chain of objects q_1,q_2,…,q_n where q_1=p and q_n=q such that q_(i+1) is directly density reachable from q_i
(ie q is density reachable from p with respect to Eps and minPts)

20
New cards

density-connected

when there is an object o such that both p and q are density reachable from o with respect to Eps and minPts

  • points that are at least density connected will end up in the same cluster

  • ie q is density-connected to p with respect to Eps and minPts

21
New cards

Determining Eps and minPts

  • Select some k (sometimes based on domain knowledge, but k=4 is often used)

  • Compute k-dist (ie distance from a point to its kth nearest neighbor) for all data points and sort them in increasing order

  • Select the k value on a plot of k-values and distances where there is a sharp change; this k value corresponds to a suitable value of minPts and the distance value at this k corresponds to a suitable Eps value

22
New cards

Characteristics of DBSCAN

  • Pro: Can handle clusters of arbitrary shapes and sizes

  • Pro: Resistant to noise and outliers

  • Con: Susceptible to the curse of dimensionality

  • Con: Can struggle with clusters of different densities

23
New cards

Hierarchal Cluster

Produces a set of nested clusters, organized as a hierarchy

24
New cards

Divisive Hierarchal Clustering

start w/ one all-inclusive cluster and at each step, split a cluster until only individual points remain

25
New cards

Agglomerative Hierarchal Clustering (HAC)

start w/ points as individual clusters and at each step, merge the closest pair of clusters

26
New cards

min/single link

the proximity btwn. closest two points in different clusters

27
New cards

max/complete link

the proximity btwn. farthest two points in different clusters

Explore top notes