Clustering Algorithms Detailed Notes

Hierarchical Clustering: Intuitive Organization

Most intuitive clustering method, mirroring human organization.
Example: Organizing languages or biological species into trees.
In 19th-century biology, trees were built based on features (e.g., animal has wings, legs).
Objective: Grouping entities with shared features, the core idea of clustering.

Representing Levels of Clustering

Excellent for representing hierarchical levels within data.
Example: Grouping five data points into four, three, or two subgroups.
Generates a tree structure, useful if data naturally falls into groups and subgroups.
Generally performs hard clustering, where each data point belongs to only one cluster.
Soft hierarchical clustering exists, based on probabilities and proportions, but hard clustering is more common.
Key Feature: Builds a tree that can be cut at different levels to determine subgroup organization.

Dendrograms: Visualizing Hierarchy

Tree diagram representing the hierarchy is called a dendrogram (dendro = tree).
Typically binary trees for simplicity (Occam's razor).
Non-binary dendrograms are rare unless in soft clustering scenarios.
Simplest case: binary tree showing hierarchical relationships.

Types of Hierarchical Clustering

Agglomerative: Starts with individual data points and merges them into larger clusters.
Divisive: Starts with all data points in one cluster and divides it into smaller clusters.
Both approaches share the same core idea of building a hierarchy.

Distance Matrices: Defining Relationships

Doesn't require data points in n-dimensional space, unlike other algorithms.
Requires a distance matrix: a notion of how far apart points are.
Data points can be diverse (medical files, YouTube videos) as long as you can measure distance between them.
Distance Matrix: A measure (0 to 1 or 0 to infinity) of how dissimilar points are.
Similarity Matrix: Can also use a similarity matrix (1 - distance matrix).

Example: Clustering Simpsons Characters

Example: Distance between twins is 1, identical characters (Lisa and Lisa) is 0.
Distance matrix is symmetrical, so only half needs to be stored.
Algorithm starts by finding the pair of points with the shortest distance.
These points are merged and treated as a single data point in the next step.

Iterative Merging Process

Algorithm iteratively merges the closest pairs until everything is in one mega-cluster.
At each step, it considers all possible mergers and chooses the pair with the smallest distance.
This process is repeated, building a tree of mergers.

Computational Complexity

Naively trying all possible clustering configurations is computationally infeasible.
With n data points, the number of possible dendrograms explodes (e.g., 10 points = 34 million dendrograms).
Brute force isn't practical.
A smarter approach is to find and merge the closest data points incrementally.

Choosing How to Calculate Distance

How to calculate the distance between a pair of clusters isn't a given; it's a choice.
Hierarchical clustering algorithms in Python have default options, but many choices exist.
This choice significantly impacts how clusters are formed.
Consider calculating the distance between existing cluster A (red) and cluster B (blue).

Different Linkage Types

Simple Linkage:
- Distance between clusters is the distance between their closest points.
- Easy to implement.
- Issue: Can create long, snake-like clusters due to small mergers.
Complete Linkage:
- Distance between clusters is the distance between their farthest points.
- Avoids snake-like clusters.
- Drawback: May not merge large, cohesive clusters if some points are far apart.
Average Linkage:
- Distance is the average distance between all points in the two clusters.
- Represents everyone in the cluster.
- Computational cost is high, especially for large datasets ( $O(n^2)$ ).
Centroid Linkage:
- Distance is the distance between the centroids of the clusters.
- Requires vector representation of data points.
- Works best when clusters are ball shaped; centroid may not represent odd-shaped clusters well.

Custom Linkage Functions

You can create custom distance metrics or linkage functions.
Example: For news stories, set distance to infinity if stories are more than seven days apart.
Can tweak the distance matrix to prioritize specific clustering goals.
Programming: Specify linkage = function f, where f is your custom function.
Consider the drawbacks and corner cases of custom algorithms.
Important to justify new measurements.

Ward's Method

Joins clusters only if it reduces the data points' overall distance from their centroids.
Calculates centroids and merges based on proximity to the overall centroid.
Implies if/else within the algorithm.
Can be adapted even without vector representation of data.

Step-by-Step Example

Example with five data points (1, 2, 3, 4, 5) and a distance matrix.
Step 1: Find closest pair (e.g., points 1 and 2).
Step 2: Create a new distance matrix.
- Rows/columns relating to data points 3, 4, and 5 remain the same.
- Calculate the distance between cluster 1-2 and everything else using simple linkage (minimum distance).
Copy values from the original matrix and apply the linkage criteria.

Iterating and Building the Tree

Repeat the operation: Find the pair of points that are closest in the new matrix.
Define new values in the distance matrix and repeat the process.
Just repeatedly computing from the distance matrix.
Eventually, all points are merged, forming a tree.

Linkage Algorithm Output

Output: A list containing the ID of point A, the ID of point B merged, the distance, and a ranking/order.
Challenge: Try writing this algorithm from scratch using distance matrix operations.

Historical Context

This method (or similar variants) was used in the 19th century to build biological classifications.
Doesn't require heavy computing.
Builds trees based on shared features.
Evolutionary biology now uses computationally intensive Bayesian methods.

Choosing Where to Cut the Tree

After building all clusters, decide where to cut the dendrogram.
Two approaches:
- Arbitrary: Cut based on desired number of clusters (e.g., two main clusters or five clusters).
- Looking at Distances: Plot the distances over which mergers occur.

Elbow Rule

Plot the number of mergers vs. the distance over which they were merged.
Merging small distances initially, then a sharp increase when merging distant clusters.
Identify the "elbow" in the plot where the slope changes significantly.
Represents the point where merging nearby points transitions to merging distant points.
Elbow rules are heuristics, not hard-coded rules.

The Importance of Examination

Examination often beats mathematical rules.
Ensure the clusters make sense and answer your research questions.
Consider different cluster numbers, even running clustering again on subsets of data.
Ultimately, unsupervised learning requires subjective judgment.
Trade-off: Fewer clusters = more generic; more clusters = more hyperspecific.

DBSCAN: Density-Based Spatial Clustering of Applications with Noise

Tackles weaknesses of hierarchical clustering.
Does not rely on centroids or merging clusters directly.
Focuses on finding dense areas of data points.
Requires points in vector space (coordinates).

Key Concepts of DBSCAN

Identifies crowded neighborhoods.
Defined by two hyperparameters:
- Epsilon (eps): Radius around a point (neighborhood size)
- Min Points: Minimum number of points within that radius for it to be dense i.e. crowded

Sensitivity to Hyperparameters

Large epsilon, low min points: Intolerant, anything feels like a cluster.
Low epsilon, high min points: Few clusters because hard to meet density requirement.

Density-Based Scanning

Algorithm scans data points: Is it crowded around you? (within radius eps). Are there min points in neighborhood?
If yes, the point is part of a cluster.
Examines points in neighborhood to see if they are crowded too.

Border and Noise Points

If a point (friend 3) in a cluster's neighborhood isn't crowded, it becomes a border point.
Points outside the neighborhood of border points are considered noise/outliers.
Clusters: Densely-packed areas with steps of size eps.
Outliers: Points far from everything.

DBSCAN Results

Output: Data points tagged cluster number 0 to N and points tagged as -1 are the outliers.
Resistant to noise and finds clusters without assuming shapes/sizes.
Finding the right epsilon and min points is challenging.

Parameter Optimization

Grid search: Test various eps and min points combinations.
Gradient Descent: Hyperparameter optimization techniques.
Requires a metric to evaluate clustering performance.

Limitations of DBSCAN

Struggles when clusters have different densities. May not be able to discern multiple clusters. Solution -- multiple runs of the algorithm with different hyper parameters.

DBSCAN Demonstration

Interactive demonstration using a smiley face dataset with outliers.
Radius and min points parameters control cluster density. If the minimum number of points is set to 1 then any data points is part of a crowd.
Min points, if correctly defined, ensures it takes a lot for it to be a cluster

Controlling Cluster Size

Setting min points to a high number (e.g., 1000 in a customer dataset) deliberately ignore small number of data points.
Can flip around to focus on small minority things to find small niches using a lowered min points parameter.

Different Run Results

Multiple runs with different hyperparameters yield different results.
Consider comparisons of different runs.
Keep track of when how data points are in the same cluster.
This can become a similarity metric.