Clustering Algorithms - Key Vocabulary
Clustering Methods Overview
- Clustering methods have different characteristics, advantages, and limitations. Major methods include:
- K-Means Clustering: Partitioning algorithm that divides data into K non-overlapping clusters based on similarity; objective is to minimize the within-cluster sum of squares.
- Hierarchical Clustering: Builds a hierarchy of clusters by recursively merging or splitting clusters; can be agglomerative (bottom-up) or divisive (top-down).
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Density-based approach that groups together points in dense regions without requiring a pre-specified number of clusters; robust to noise.
- OPTICS (Ordering Points To Identify the Clustering Structure): Extension of DBSCAN that produces a hierarchical clustering structure and handles varying densities better.
- Mean Shift Clustering: Non-parametric method that shifts centroids toward areas of higher data density to identify clusters.
- Agglomerative Clustering: A hierarchical method that starts with individual points and repeatedly merges clusters using a linkage criterion (e.g., single, complete, average).
- BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies): Designed for large datasets; incrementally builds a clustering hierarchy using a tree structure.
- Gaussian Mixture Models (GMM): Model-based clustering assuming data come from a mixture of Gaussian distributions; parameters are estimated to identify clusters.
- Fuzzy C-Means Clustering: Soft clustering where points have degrees of membership to multiple clusters.
- Spectral Clustering: Graph-based method using eigenvalues of a similarity matrix to partition data.
- Affinity Propagation: Identifies exemplars (representative points) by message passing between data points.
- Self-Organizing Maps (SOM): Unsupervised learning that projects high-dimensional data onto a low-dimensional grid and clusters similar points.
- These are examples; many variations exist. Choice depends on data nature, desired clustering structure, computation, and interpretability.
K-Means Clustering
Overview: Popular unsupervised algorithm for clustering similar data points into K groups; aims to minimize the within-cluster sum of squares (inertia).
Distance metric: Typically uses Euclidean distance, though other metrics can be used.
Key objective: Minimize inertia, the sum of squared distances of samples to their assigned cluster centroids.
Common reasons for use: Simple, scalable, widely understood.
Key concepts:
- Centroids: Each cluster has a centroid representing its center; for a cluster C, centroid μ is the mean of all points in C.
- Assignment based on proximity: Each point is assigned to the nearest centroid.
- Initialisation sensitivity: Outcomes depend on initial centroid positions; different runs can yield different clusters.
- Hyperparameter K: The number of clusters must be specified in advance.
- Mitigation for sensitivity: Run the algorithm multiple times with different initializations and choose the clustering with the lowest inertia.
Outputs after convergence:
- Final cluster centroids representing cluster centers.
- Each data point assigned to one of the clusters.
Key points:
- Inertia (within-cluster sum of squares) is minimized: \mathrm{Inertia} = \sum{k=1}^K \sum{xi \in Ck} | xi - \muk |^2.
- Initialisation sensitivity can lead to different results; multiple restarts help.
- K is a hyperparameter that needs to be chosen prior to running the algorithm.
- K-Means is simple and scalable, which contributes to its popularity.
K-Means algorithm (steps):
- Initialization:
- Choose the number of clusters K.
- Randomly initialize K cluster centroids in the data space.
- Assignment Step:
- For each data point, compute the distance to each centroid.
- Assign the point to the nearest centroid.
- Note: Euclidean distance is the typical metric; other metrics can be used.
- Update Step:
- Recompute the centroid of each cluster as the mean of all points assigned to that cluster.
- Convergence Check:
- Check if centroids have stabilized (no significant change) or a maximum number of iterations is reached.
- If stabilized, stop; otherwise, repeat steps 2 and 3.
- Output:
- Final centroids represent cluster centers.
- Each data point has an assigned cluster.
K-Means pseudocode (from transcript):
- Input: K (number of clusters), D (list of data points)
- 1. Choose K random data points as initial centroids.
- 2. Repeat until centroids stabilize:
- a. Allocate each point in D to the nearest of the K centroids.
- b. Compute the centroid for the cluster using all points in the cluster.
Practical notes on K-Means:
- Step-by-step flow mirrors standard practice; convergence is typically fast in practice for many datasets.
- The elbow method and silhouette method are used to help determine a reasonable K.
Determining the number of clusters (K) for K-Means:
- Two common methods:
- Elbow Method: Plot within-cluster sum of squares (WCSS) against K and look for a bend (elbow) where improvement slows.
- Silhouette Method: Use silhouette scores to assess how well each point lies within its cluster compared to neighboring clusters.
K-Means clustering workflow for choosing K (Elbow Method example):
- Step 1: Apply K-Means for several values of K (e.g., K = 1 to 10).
- Step 2: For each K, compute WCSS: \mathrm{WCSS} = \sum{di \in D} dist(di, Ck)^2 where Ck is the centroid of the cluster to which di belongs.
- Step 3: Plot WCSS versus K.
- Step 4: The elbow point (bend) on the plot suggests the approximate number of clusters.
- Notation from transcript visuals: K1, K2, K3 are example cluster counts; the elbow location indicates the optimal K.
Practical considerations about K selection and interpretation:
- The elbow point is not always unambiguous; multiple interpretations may exist.
- Silhouette scores provide a complementary perspective on cluster separation and compactness.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
- Purpose: Density-based clustering that can identify clusters of arbitrary shape and handle noise; does not require specifying the number of clusters upfront.
- Key parameters:
- ε (epsilon): Maximum distance between two points for them to be considered neighbors.
- MinPts: Minimum number of points required to form a dense region (including the point itself).
- Core points, border points, and noise points:
- Core point: A point with at least MinPts points (including itself) within its ε-neighborhood.
- Border point: A point within the ε-neighborhood of a core point but not a core point itself.
- Noise point: Neither a core point nor a border point.
- Cluster formation process:
- Start with an arbitrary unvisited point.
- Find all points within its ε-neighborhood and classify as core, border, or noise.
- If the selected point is a core point, recursively expand the cluster by including all connected core and border points within its ε-neighborhood.
- Repeat with another unvisited point until all points are visited.
- Cluster assignment and output:
- Points are assigned to the cluster of their nearest core point if they are border points.
- Noise points remain unassigned.
- Output is a set of clusters plus identified noise points.
- Key properties and advantages:
- Identifies clusters of arbitrary shape.
- Handles clusters of varying densities better than many other methods.
- Does not require specifying the number of clusters beforehand.
- Robust to noise and outliers (to a degree).
- Limitations and considerations:
- Performance and results are sensitive to the choice of ε and MinPts; parameters may need tuning to match data characteristics.
Other Clustering Methods (brief descriptions from transcript)
- Agglomerative Clustering: A hierarchical method starting with each data point as its own cluster and recursively merging clusters based on a linkage criterion (e.g., single linkage, complete linkage, average linkage).
- Mean Shift Clustering: A non-parametric clustering technique that shifts centroids toward areas of higher data density to discover clusters.
- BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies): A hierarchical clustering method designed for large datasets that incrementally builds a clustering hierarchy using a tree structure.
- Gaussian Mixture Models (GMM): A model-based clustering approach assuming data are generated from a mixture of Gaussian distributions; it estimates the parameters of the distributions to identify clusters.
- Fuzzy C-Means Clustering: A soft clustering algorithm where data points are assigned to clusters probabilistically, allowing multiple memberships with varying degrees.
- Spectral Clustering: A graph-based technique that uses the eigenvalues (and eigenvectors) of a similarity matrix to partition data.
- Affinity Propagation: A clustering method that identifies exemplars within the data and iteratively updates clusters via message passing.
- Self-Organizing Maps (SOM): An unsupervised learning algorithm projecting high-dimensional data onto a low-dimensional grid to cluster similar points.
- OPTICS (Ordering Points To Identify the Clustering Structure): Produces a hierarchical clustering structure and is more robust to varying densities within clusters.
- Practical note: The choice of method depends on data characteristics, the desired clustering structure, computational considerations, and interpretability of results.
Practical Takeaways and Connections
- Distance metrics matter: K-Means typically uses Euclidean distance; other metrics can be used depending on data and problem.
- Cluster shape assumptions:
- K-Means assumes roughly spherical/hyperellipsoidal clusters; not ideal for non-hyperellipsoidal shapes.
- DBSCAN and OPTICS can detect irregular shapes and varied densities.
- Model-based vs partitioning vs density-based:
- GMM is model-based, providing probabilistic cluster memberships.
- DBSCAN/OPTICS are density-based, robust to noise and capable of discovering clusters of arbitrary shapes.
- Spectral and hierarchical methods offer different perspectives on clustering structure.
- Determining the number of clusters is a common challenge (K in K-Means; none in DBSCAN/OPTICS).
- Practical implications:
- Initialisation and parameter choices can strongly influence results.
- Computational efficiency varies: K-Means scales well; DBSCAN/OPTICS can be slower on large datasets and require careful parameter tuning.
- Interpretability: Simpler methods like K-Means are easier to interpret but may be less flexible; more complex methods can capture nuanced structures but may be harder to explain.
Equations and Notation (as referenced in transcript)
Within-cluster sum of squares (WCSS) / inertia for K-Means:
- Inertia: \mathrm{Inertia} = \sum{k=1}^K \sum{xi \in Ck} | xi - \muk |^2
- Alternatively, in a compact form: \mathrm{WCSS} = \sum{di \in D} \text{dist}(di, C{\text{cluster}(di)})^2 where Ck is the centroid of cluster k and cluster(di) is the cluster containing di.
Step-wise K-Means updates rely on centroid computation and assignment based on distance to centroids.
For the Elbow Method: no explicit formula beyond WCSS; the method relies on plotting and identifying the elbow bend.
For DBSCAN: core concepts are defined via ε and MinPts, but no single formula is provided in the transcript beyond the distance-based neighborhood concept.
Connections to Foundational Concepts
- Clustering vs classification: Clustering is unsupervised; labels are not provided a priori.
- Distance metrics underpin most clustering decisions and affect cluster boundaries.
- Dimensionality and density influence method performance; higher-dimensional spaces pose challenges for distance-based methods due to the curse of dimensionality.
Practical Tips for Exam Preparation
- Be able to describe each method’s core idea, typical use cases, and key advantages/disadvantages.
- Memorize the two primary K selection techniques and how to apply them (Elbow and Silhouette) with brief steps.
- Know the standard K-Means update rules and the role of centroids as cluster means.
- Understand DBSCAN core concepts (core/border/noise) and parameter meanings (ε, MinPts).
- Recognize the kinds of data shapes and density patterns each method can handle well or poorly.
- Be able to write and interpret the WCSS formula and explain the purpose of plotting WCSS versus K in the Elbow Method.