CS363M Flashcards Lecture 12-14

0.0(0)

Studied by 1 person

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/26

Earn XP

Description and Tags

Clustering Techniques

Last updated 4:42 PM on 4/1/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

27 Terms

New cards

Unsupervised ML

machine learning where there are no target class labels

New cards

Clustering

ML technique where data points are divided into groups that are potentially meaningful or useful; items inside cluster should be similar to each other, but different than other clusters

New cards

K-means algorithm

Choose k random points from training data to be initial centroids
Each data point assigned to closest centroid to form clusters
Update centroid of each cluster based on points assigned to that cluster
Re-assign points to closest centroid
Repeat until no points change cluster (or clusters keep rotating)

New cards

What is the goal when using Euclidean distance as a proximity measure for the k-means algorithm

Minimize the SSE (aka scatter)

New cards

What are some ways you can respond to an empty centroid during a k-means algorithm?

Pick datapoint furthest from its centroid
Pick random datapoint (again)
Pick random datapoint within the cluster with the highest SSE

New cards

K-means++

Centroid initialization algorithm (run before k-means

Choose a random centroid
Repeat the following until there are k centroids
- Compute distance to every point to its nearest centroid
- Use squared-distance as a probablility distribution for choosing the next centroid (farther are more likely to be selected)

New cards

Why do we not always want to choose the farthest point during the k-means++ algorithm?

farthest point could be an outlier
deterministic algorithms with suboptimal solution will always end up at the same suboptimal solution

New cards

Bisecting K-means

To obtain k clusters, split the set of all data points into 2 clusters using k-means with k = 2

Choose one of the clusters to split
Split the chosen cluster using k-means until k=2
Continue until you have k clusters
Less susceptible to initialization problems than regular k-means; more likely to reach optimal solution than regular k-means and also faster

New cards

How can we choose the right k for k-means

Try different k’s and evaluate the results
Dataset / Domain knowledge
Use case (ie 5 clusters if you only want to send 5 emails)
Elbow Method — graph the SSEs as k increases look for points where the SSE values start to flatten (ie converge toward 0)
- Note: if k = <size of dataset>, then SSE=0 but we have “overfitting” (which doesn’t really exist with unsupervised learning, but for lack of a better term…)

New cards

Characteristics of k-means

Pro: Simple; can be used for wide variety of data types
Pro: Quite efficient, even when run multiple times
Variations exist that are more efficient and less susceptible to initialization problems
Outliers can alter results, but outlier detection and remove can be done before clustering
Con: Susceptible to the Curse of Dimensionality
Feature selection / feature engineering is critical
Con: Algorithm has difficulties with non-globular (ie non-spherical) data, clusters of different sizes, and clusters of different densitiesj

New cards

What is one way we can maybe make k-means work better on non-globular data? What is one limitation of this strategy?

Increase the number of clusters
There is no way to know if different clusters are next to each other / contigous

New cards

density-based clustering

locates regions of high density that are separated by regions of low density

New cards

Center-based density

i.e. density is calculated for a particular point in a data set by counting the number of points within a specified radius (Eps) of that point (count includes the point itself)

New cards

Core points

points on the interior of a dense region; points with at least minPts neighbors within Eps radius of the point

New cards

Border points

points on the border of a dense region; point that is not a core point, but falls in the neighborhood (ie within Eps radius) of a core point

New cards

Noise points

points in sparse regions; points that are neither core nor border points

New cards

DBSCAN

Classify all points as core, border, or noise
Label noise points
Any two core points that are within Eps of each other are put into the same cluster
Any border point that is within Eps of a core point is put into the same cluster as the core point (ties may need to be resolved)

New cards

directly density reachable

when q is in the Eps-neighborhood of p and p is a core point

( ie q is directly density reachable from p)

New cards

density reachable

when there is a chain of objects q_1,q_2,…,q_n where q_1=p and q_n=q such that q_(i+1) is directly density reachable from q_i
(ie q is density reachable from p with respect to Eps and minPts)

New cards

density-connected

when there is an object o such that both p and q are density reachable from o with respect to Eps and minPts

points that are at least density connected will end up in the same cluster
ie q is density-connected to p with respect to Eps and minPts

New cards

Determining Eps and minPts

Select some k (sometimes based on domain knowledge, but k=4 is often used)
Compute k-dist (ie distance from a point to its kth nearest neighbor) for all data points and sort them in increasing order
Select the k value on a plot of k-values and distances where there is a sharp change; this k value corresponds to a suitable value of minPts and the distance value at this k corresponds to a suitable Eps value

New cards

Characteristics of DBSCAN