1/26
Clustering Techniques
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Unsupervised ML
machine learning where there are no target class labels
Clustering
ML technique where data points are divided into groups that are potentially meaningful or useful; items inside cluster should be similar to each other, but different than other clusters
K-means algorithm
Choose k random points from training data to be initial centroids
Each data point assigned to closest centroid to form clusters
Update centroid of each cluster based on points assigned to that cluster
Re-assign points to closest centroid
Repeat until no points change cluster (or clusters keep rotating)
What is the goal when using Euclidean distance as a proximity measure for the k-means algorithm
Minimize the SSE (aka scatter)
What are some ways you can respond to an empty centroid during a k-means algorithm?
Pick datapoint furthest from its centroid
Pick random datapoint (again)
Pick random datapoint within the cluster with the highest SSE
K-means++
Centroid initialization algorithm (run before k-means
Choose a random centroid
Repeat the following until there are k centroids
Compute distance to every point to its nearest centroid
Use squared-distance as a probablility distribution for choosing the next centroid (farther are more likely to be selected)
Why do we not always want to choose the farthest point during the k-means++ algorithm?
farthest point could be an outlier
deterministic algorithms with suboptimal solution will always end up at the same suboptimal solution
Bisecting K-means
To obtain k clusters, split the set of all data points into 2 clusters using k-means with k = 2
Choose one of the clusters to split
Split the chosen cluster using k-means until k=2
Continue until you have k clusters
Less susceptible to initialization problems than regular k-means; more likely to reach optimal solution than regular k-means and also faster
How can we choose the right k for k-means
Try different k’s and evaluate the results
Dataset / Domain knowledge
Use case (ie 5 clusters if you only want to send 5 emails)
Elbow Method — graph the SSEs as k increases look for points where the SSE values start to flatten (ie converge toward 0)
Note: if k = <size of dataset>, then SSE=0 but we have “overfitting” (which doesn’t really exist with unsupervised learning, but for lack of a better term…)
Characteristics of k-means
Pro: Simple; can be used for wide variety of data types
Pro: Quite efficient, even when run multiple times
Variations exist that are more efficient and less susceptible to initialization problems
Outliers can alter results, but outlier detection and remove can be done before clustering
Con: Susceptible to the Curse of Dimensionality
Feature selection / feature engineering is critical
Con: Algorithm has difficulties with non-globular (ie non-spherical) data, clusters of different sizes, and clusters of different densitiesj
What is one way we can maybe make k-means work better on non-globular data? What is one limitation of this strategy?
Increase the number of clusters
There is no way to know if different clusters are next to each other / contigous
density-based clustering
locates regions of high density that are separated by regions of low density
Center-based density
i.e. density is calculated for a particular point in a data set by counting the number of points within a specified radius (Eps) of that point (count includes the point itself)
Core points
points on the interior of a dense region; points with at least minPts neighbors within Eps radius of the point
Border points
points on the border of a dense region; point that is not a core point, but falls in the neighborhood (ie within Eps radius) of a core point
Noise points
points in sparse regions; points that are neither core nor border points
DBSCAN
Classify all points as core, border, or noise
Label noise points
Any two core points that are within Eps of each other are put into the same cluster
Any border point that is within Eps of a core point is put into the same cluster as the core point (ties may need to be resolved)
directly density reachable
when q is in the Eps-neighborhood of p and p is a core point
( ie q is directly density reachable from p)
density reachable
when there is a chain of objects q_1,q_2,…,q_n where q_1=p and q_n=q such that q_(i+1) is directly density reachable from q_i
(ie q is density reachable from p with respect to Eps and minPts)
density-connected
when there is an object o such that both p and q are density reachable from o with respect to Eps and minPts
points that are at least density connected will end up in the same cluster
ie q is density-connected to p with respect to Eps and minPts
Determining Eps and minPts
Select some k (sometimes based on domain knowledge, but k=4 is often used)
Compute k-dist (ie distance from a point to its kth nearest neighbor) for all data points and sort them in increasing order
Select the k value on a plot of k-values and distances where there is a sharp change; this k value corresponds to a suitable value of minPts and the distance value at this k corresponds to a suitable Eps value
Characteristics of DBSCAN
Pro: Can handle clusters of arbitrary shapes and sizes
Pro: Resistant to noise and outliers
Con: Susceptible to the curse of dimensionality
Con: Can struggle with clusters of different densities
Hierarchal Cluster
Produces a set of nested clusters, organized as a hierarchy
Divisive Hierarchal Clustering
start w/ one all-inclusive cluster and at each step, split a cluster until only individual points remain
Agglomerative Hierarchal Clustering (HAC)
start w/ points as individual clusters and at each step, merge the closest pair of clusters
min/single link
the proximity btwn. closest two points in different clusters
max/complete link
the proximity btwn. farthest two points in different clusters