Clustering Algorithms Detailed Notes

Hierarchical Clustering: Intuitive Organization

  • Most intuitive clustering method, mirroring human organization.
  • Example: Organizing languages or biological species into trees.
  • In 19th-century biology, trees were built based on features (e.g., animal has wings, legs).
  • Objective: Grouping entities with shared features, the core idea of clustering.

Representing Levels of Clustering

  • Excellent for representing hierarchical levels within data.
  • Example: Grouping five data points into four, three, or two subgroups.
  • Generates a tree structure, useful if data naturally falls into groups and subgroups.
  • Generally performs hard clustering, where each data point belongs to only one cluster.
  • Soft hierarchical clustering exists, based on probabilities and proportions, but hard clustering is more common.
  • Key Feature: Builds a tree that can be cut at different levels to determine subgroup organization.

Dendrograms: Visualizing Hierarchy

  • Tree diagram representing the hierarchy is called a dendrogram (dendro = tree).
  • Typically binary trees for simplicity (Occam's razor).
  • Non-binary dendrograms are rare unless in soft clustering scenarios.
  • Simplest case: binary tree showing hierarchical relationships.

Types of Hierarchical Clustering

  • Agglomerative: Starts with individual data points and merges them into larger clusters.
  • Divisive: Starts with all data points in one cluster and divides it into smaller clusters.
  • Both approaches share the same core idea of building a hierarchy.

Distance Matrices: Defining Relationships

  • Doesn't require data points in n-dimensional space, unlike other algorithms.
  • Requires a distance matrix: a notion of how far apart points are.
  • Data points can be diverse (medical files, YouTube videos) as long as you can measure distance between them.
  • Distance Matrix: A measure (0 to 1 or 0 to infinity) of how dissimilar points are.
  • Similarity Matrix: Can also use a similarity matrix (1 - distance matrix).

Example: Clustering Simpsons Characters

  • Example: Distance between twins is 1, identical characters (Lisa and Lisa) is 0.
  • Distance matrix is symmetrical, so only half needs to be stored.
  • Algorithm starts by finding the pair of points with the shortest distance.
  • These points are merged and treated as a single data point in the next step.

Iterative Merging Process

  • Algorithm iteratively merges the closest pairs until everything is in one mega-cluster.
  • At each step, it considers all possible mergers and chooses the pair with the smallest distance.
  • This process is repeated, building a tree of mergers.

Computational Complexity

  • Naively trying all possible clustering configurations is computationally infeasible.
  • With n data points, the number of possible dendrograms explodes (e.g., 10 points = 34 million dendrograms).
  • Brute force isn't practical.
  • A smarter approach is to find and merge the closest data points incrementally.

Choosing How to Calculate Distance

  • How to calculate the distance between a pair of clusters isn't a given; it's a choice.
  • Hierarchical clustering algorithms in Python have default options, but many choices exist.
  • This choice significantly impacts how clusters are formed.
  • Consider calculating the distance between existing cluster A (red) and cluster B (blue).

Different Linkage Types

  • Simple Linkage:

    • Distance between clusters is the distance between their closest points.
    • Easy to implement.
    • Issue: Can create long, snake-like clusters due to small mergers.
  • Complete Linkage:

    • Distance between clusters is the distance between their farthest points.
    • Avoids snake-like clusters.
    • Drawback: May not merge large, cohesive clusters if some points are far apart.
  • Average Linkage:

    • Distance is the average distance between all points in the two clusters.
    • Represents everyone in the cluster.
    • Computational cost is high, especially for large datasets (O(n2)O(n^2)).
  • Centroid Linkage:

    • Distance is the distance between the centroids of the clusters.
    • Requires vector representation of data points.
    • Works best when clusters are ball shaped; centroid may not represent odd-shaped clusters well.

Custom Linkage Functions

  • You can create custom distance metrics or linkage functions.
  • Example: For news stories, set distance to infinity if stories are more than seven days apart.
  • Can tweak the distance matrix to prioritize specific clustering goals.
  • Programming: Specify linkage = function f, where f is your custom function.
  • Consider the drawbacks and corner cases of custom algorithms.
  • Important to justify new measurements.

Ward's Method

  • Joins clusters only if it reduces the data points' overall distance from their centroids.
  • Calculates centroids and merges based on proximity to the overall centroid.
  • Implies if/else within the algorithm.
  • Can be adapted even without vector representation of data.

Step-by-Step Example

  • Example with five data points (1, 2, 3, 4, 5) and a distance matrix.
  • Step 1: Find closest pair (e.g., points 1 and 2).
  • Step 2: Create a new distance matrix.
    • Rows/columns relating to data points 3, 4, and 5 remain the same.
    • Calculate the distance between cluster 1-2 and everything else using simple linkage (minimum distance).
  • Copy values from the original matrix and apply the linkage criteria.

Iterating and Building the Tree

  • Repeat the operation: Find the pair of points that are closest in the new matrix.
  • Define new values in the distance matrix and repeat the process.
  • Just repeatedly computing from the distance matrix.
  • Eventually, all points are merged, forming a tree.

Linkage Algorithm Output

  • Output: A list containing the ID of point A, the ID of point B merged, the distance, and a ranking/order.
  • Challenge: Try writing this algorithm from scratch using distance matrix operations.

Historical Context

  • This method (or similar variants) was used in the 19th century to build biological classifications.
  • Doesn't require heavy computing.
  • Builds trees based on shared features.
  • Evolutionary biology now uses computationally intensive Bayesian methods.

Choosing Where to Cut the Tree

  • After building all clusters, decide where to cut the dendrogram.

  • Two approaches:

    • Arbitrary: Cut based on desired number of clusters (e.g., two main clusters or five clusters).
    • Looking at Distances: Plot the distances over which mergers occur.

Elbow Rule

  • Plot the number of mergers vs. the distance over which they were merged.
  • Merging small distances initially, then a sharp increase when merging distant clusters.
  • Identify the "elbow" in the plot where the slope changes significantly.
  • Represents the point where merging nearby points transitions to merging distant points.
  • Elbow rules are heuristics, not hard-coded rules.

The Importance of Examination

  • Examination often beats mathematical rules.
  • Ensure the clusters make sense and answer your research questions.
  • Consider different cluster numbers, even running clustering again on subsets of data.
  • Ultimately, unsupervised learning requires subjective judgment.
  • Trade-off: Fewer clusters = more generic; more clusters = more hyperspecific.

DBSCAN: Density-Based Spatial Clustering of Applications with Noise

  • Tackles weaknesses of hierarchical clustering.
  • Does not rely on centroids or merging clusters directly.
  • Focuses on finding dense areas of data points.
  • Requires points in vector space (coordinates).

Key Concepts of DBSCAN

  • Identifies crowded neighborhoods.
  • Defined by two hyperparameters:
    • Epsilon (eps): Radius around a point (neighborhood size)
    • Min Points: Minimum number of points within that radius for it to be dense i.e. crowded

Sensitivity to Hyperparameters

  • Large epsilon, low min points: Intolerant, anything feels like a cluster.
  • Low epsilon, high min points: Few clusters because hard to meet density requirement.

Density-Based Scanning

  • Algorithm scans data points: Is it crowded around you? (within radius eps). Are there min points in neighborhood?
  • If yes, the point is part of a cluster.
  • Examines points in neighborhood to see if they are crowded too.

Border and Noise Points

  • If a point (friend 3) in a cluster's neighborhood isn't crowded, it becomes a border point.
  • Points outside the neighborhood of border points are considered noise/outliers.
  • Clusters: Densely-packed areas with steps of size eps.
  • Outliers: Points far from everything.

DBSCAN Results

  • Output: Data points tagged cluster number 0 to N and points tagged as -1 are the outliers.
  • Resistant to noise and finds clusters without assuming shapes/sizes.
  • Finding the right epsilon and min points is challenging.

Parameter Optimization

  • Grid search: Test various eps and min points combinations.
  • Gradient Descent: Hyperparameter optimization techniques.
  • Requires a metric to evaluate clustering performance.

Limitations of DBSCAN

  • Struggles when clusters have different densities. May not be able to discern multiple clusters. Solution -- multiple runs of the algorithm with different hyper parameters.

DBSCAN Demonstration

  • Interactive demonstration using a smiley face dataset with outliers.
  • Radius and min points parameters control cluster density. If the minimum number of points is set to 1 then any data points is part of a crowd.
  • Min points, if correctly defined, ensures it takes a lot for it to be a cluster

Controlling Cluster Size

  • Setting min points to a high number (e.g., 1000 in a customer dataset) deliberately ignore small number of data points.
  • Can flip around to focus on small minority things to find small niches using a lowered min points parameter.

Different Run Results

  • Multiple runs with different hyperparameters yield different results.
  • Consider comparisons of different runs.
  • Keep track of when how data points are in the same cluster.
  • This can become a similarity metric.