Clustering Algorithms Detailed Notes
Hierarchical Clustering: Intuitive Organization
- Most intuitive clustering method, mirroring human organization.
- Example: Organizing languages or biological species into trees.
- In 19th-century biology, trees were built based on features (e.g., animal has wings, legs).
- Objective: Grouping entities with shared features, the core idea of clustering.
Representing Levels of Clustering
- Excellent for representing hierarchical levels within data.
- Example: Grouping five data points into four, three, or two subgroups.
- Generates a tree structure, useful if data naturally falls into groups and subgroups.
- Generally performs hard clustering, where each data point belongs to only one cluster.
- Soft hierarchical clustering exists, based on probabilities and proportions, but hard clustering is more common.
- Key Feature: Builds a tree that can be cut at different levels to determine subgroup organization.
Dendrograms: Visualizing Hierarchy
- Tree diagram representing the hierarchy is called a dendrogram (dendro = tree).
- Typically binary trees for simplicity (Occam's razor).
- Non-binary dendrograms are rare unless in soft clustering scenarios.
- Simplest case: binary tree showing hierarchical relationships.
Types of Hierarchical Clustering
- Agglomerative: Starts with individual data points and merges them into larger clusters.
- Divisive: Starts with all data points in one cluster and divides it into smaller clusters.
- Both approaches share the same core idea of building a hierarchy.
Distance Matrices: Defining Relationships
- Doesn't require data points in n-dimensional space, unlike other algorithms.
- Requires a distance matrix: a notion of how far apart points are.
- Data points can be diverse (medical files, YouTube videos) as long as you can measure distance between them.
- Distance Matrix: A measure (0 to 1 or 0 to infinity) of how dissimilar points are.
- Similarity Matrix: Can also use a similarity matrix (1 - distance matrix).
Example: Clustering Simpsons Characters
- Example: Distance between twins is 1, identical characters (Lisa and Lisa) is 0.
- Distance matrix is symmetrical, so only half needs to be stored.
- Algorithm starts by finding the pair of points with the shortest distance.
- These points are merged and treated as a single data point in the next step.
Iterative Merging Process
- Algorithm iteratively merges the closest pairs until everything is in one mega-cluster.
- At each step, it considers all possible mergers and chooses the pair with the smallest distance.
- This process is repeated, building a tree of mergers.
Computational Complexity
- Naively trying all possible clustering configurations is computationally infeasible.
- With n data points, the number of possible dendrograms explodes (e.g., 10 points = 34 million dendrograms).
- Brute force isn't practical.
- A smarter approach is to find and merge the closest data points incrementally.
Choosing How to Calculate Distance
- How to calculate the distance between a pair of clusters isn't a given; it's a choice.
- Hierarchical clustering algorithms in Python have default options, but many choices exist.
- This choice significantly impacts how clusters are formed.
- Consider calculating the distance between existing cluster A (red) and cluster B (blue).
Different Linkage Types
Simple Linkage:
- Distance between clusters is the distance between their closest points.
- Easy to implement.
- Issue: Can create long, snake-like clusters due to small mergers.
Complete Linkage:
- Distance between clusters is the distance between their farthest points.
- Avoids snake-like clusters.
- Drawback: May not merge large, cohesive clusters if some points are far apart.
Average Linkage:
- Distance is the average distance between all points in the two clusters.
- Represents everyone in the cluster.
- Computational cost is high, especially for large datasets ().
Centroid Linkage:
- Distance is the distance between the centroids of the clusters.
- Requires vector representation of data points.
- Works best when clusters are ball shaped; centroid may not represent odd-shaped clusters well.
Custom Linkage Functions
- You can create custom distance metrics or linkage functions.
- Example: For news stories, set distance to infinity if stories are more than seven days apart.
- Can tweak the distance matrix to prioritize specific clustering goals.
- Programming: Specify linkage = function f, where f is your custom function.
- Consider the drawbacks and corner cases of custom algorithms.
- Important to justify new measurements.
Ward's Method
- Joins clusters only if it reduces the data points' overall distance from their centroids.
- Calculates centroids and merges based on proximity to the overall centroid.
- Implies if/else within the algorithm.
- Can be adapted even without vector representation of data.
Step-by-Step Example
- Example with five data points (1, 2, 3, 4, 5) and a distance matrix.
- Step 1: Find closest pair (e.g., points 1 and 2).
- Step 2: Create a new distance matrix.
- Rows/columns relating to data points 3, 4, and 5 remain the same.
- Calculate the distance between cluster 1-2 and everything else using simple linkage (minimum distance).
- Copy values from the original matrix and apply the linkage criteria.
Iterating and Building the Tree
- Repeat the operation: Find the pair of points that are closest in the new matrix.
- Define new values in the distance matrix and repeat the process.
- Just repeatedly computing from the distance matrix.
- Eventually, all points are merged, forming a tree.
Linkage Algorithm Output
- Output: A list containing the ID of point A, the ID of point B merged, the distance, and a ranking/order.
- Challenge: Try writing this algorithm from scratch using distance matrix operations.
Historical Context
- This method (or similar variants) was used in the 19th century to build biological classifications.
- Doesn't require heavy computing.
- Builds trees based on shared features.
- Evolutionary biology now uses computationally intensive Bayesian methods.
Choosing Where to Cut the Tree
After building all clusters, decide where to cut the dendrogram.
Two approaches:
- Arbitrary: Cut based on desired number of clusters (e.g., two main clusters or five clusters).
- Looking at Distances: Plot the distances over which mergers occur.
Elbow Rule
- Plot the number of mergers vs. the distance over which they were merged.
- Merging small distances initially, then a sharp increase when merging distant clusters.
- Identify the "elbow" in the plot where the slope changes significantly.
- Represents the point where merging nearby points transitions to merging distant points.
- Elbow rules are heuristics, not hard-coded rules.
The Importance of Examination
- Examination often beats mathematical rules.
- Ensure the clusters make sense and answer your research questions.
- Consider different cluster numbers, even running clustering again on subsets of data.
- Ultimately, unsupervised learning requires subjective judgment.
- Trade-off: Fewer clusters = more generic; more clusters = more hyperspecific.
DBSCAN: Density-Based Spatial Clustering of Applications with Noise
- Tackles weaknesses of hierarchical clustering.
- Does not rely on centroids or merging clusters directly.
- Focuses on finding dense areas of data points.
- Requires points in vector space (coordinates).
Key Concepts of DBSCAN
- Identifies crowded neighborhoods.
- Defined by two hyperparameters:
- Epsilon (eps): Radius around a point (neighborhood size)
- Min Points: Minimum number of points within that radius for it to be dense i.e. crowded
Sensitivity to Hyperparameters
- Large epsilon, low min points: Intolerant, anything feels like a cluster.
- Low epsilon, high min points: Few clusters because hard to meet density requirement.
Density-Based Scanning
- Algorithm scans data points: Is it crowded around you? (within radius eps). Are there min points in neighborhood?
- If yes, the point is part of a cluster.
- Examines points in neighborhood to see if they are crowded too.
Border and Noise Points
- If a point (friend 3) in a cluster's neighborhood isn't crowded, it becomes a border point.
- Points outside the neighborhood of border points are considered noise/outliers.
- Clusters: Densely-packed areas with steps of size eps.
- Outliers: Points far from everything.
DBSCAN Results
- Output: Data points tagged cluster number 0 to N and points tagged as -1 are the outliers.
- Resistant to noise and finds clusters without assuming shapes/sizes.
- Finding the right epsilon and min points is challenging.
Parameter Optimization
- Grid search: Test various eps and min points combinations.
- Gradient Descent: Hyperparameter optimization techniques.
- Requires a metric to evaluate clustering performance.
Limitations of DBSCAN
- Struggles when clusters have different densities. May not be able to discern multiple clusters. Solution -- multiple runs of the algorithm with different hyper parameters.
DBSCAN Demonstration
- Interactive demonstration using a smiley face dataset with outliers.
- Radius and min points parameters control cluster density. If the minimum number of points is set to 1 then any data points is part of a crowd.
- Min points, if correctly defined, ensures it takes a lot for it to be a cluster
Controlling Cluster Size
- Setting min points to a high number (e.g., 1000 in a customer dataset) deliberately ignore small number of data points.
- Can flip around to focus on small minority things to find small niches using a lowered min points parameter.
Different Run Results
- Multiple runs with different hyperparameters yield different results.
- Consider comparisons of different runs.
- Keep track of when how data points are in the same cluster.
- This can become a similarity metric.