Clustering Algorithms - Key Vocabulary

Clustering Methods Overview

Clustering methods have different characteristics, advantages, and limitations. Major methods include:
- K-Means Clustering: Partitioning algorithm that divides data into K non-overlapping clusters based on similarity; objective is to minimize the within-cluster sum of squares.
- Hierarchical Clustering: Builds a hierarchy of clusters by recursively merging or splitting clusters; can be agglomerative (bottom-up) or divisive (top-down).
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Density-based approach that groups together points in dense regions without requiring a pre-specified number of clusters; robust to noise.
- OPTICS (Ordering Points To Identify the Clustering Structure): Extension of DBSCAN that produces a hierarchical clustering structure and handles varying densities better.
- Mean Shift Clustering: Non-parametric method that shifts centroids toward areas of higher data density to identify clusters.
- Agglomerative Clustering: A hierarchical method that starts with individual points and repeatedly merges clusters using a linkage criterion (e.g., single, complete, average).
- BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies): Designed for large datasets; incrementally builds a clustering hierarchy using a tree structure.
- Gaussian Mixture Models (GMM): Model-based clustering assuming data come from a mixture of Gaussian distributions; parameters are estimated to identify clusters.
- Fuzzy C-Means Clustering: Soft clustering where points have degrees of membership to multiple clusters.
- Spectral Clustering: Graph-based method using eigenvalues of a similarity matrix to partition data.
- Affinity Propagation: Identifies exemplars (representative points) by message passing between data points.
- Self-Organizing Maps (SOM): Unsupervised learning that projects high-dimensional data onto a low-dimensional grid and clusters similar points.
These are examples; many variations exist. Choice depends on data nature, desired clustering structure, computation, and interpretability.

K-Means Clustering

Overview: Popular unsupervised algorithm for clustering similar data points into K groups; aims to minimize the within-cluster sum of squares (inertia).
Distance metric: Typically uses Euclidean distance, though other metrics can be used.
Key objective: Minimize inertia, the sum of squared distances of samples to their assigned cluster centroids.
Common reasons for use: Simple, scalable, widely understood.
Key concepts:
- Centroids: Each cluster has a centroid representing its center; for a cluster C, centroid μ is the mean of all points in C.
- Assignment based on proximity: Each point is assigned to the nearest centroid.
- Initialisation sensitivity: Outcomes depend on initial centroid positions; different runs can yield different clusters.
- Hyperparameter K: The number of clusters must be specified in advance.
- Mitigation for sensitivity: Run the algorithm multiple times with different initializations and choose the clustering with the lowest inertia.
Outputs after convergence:
- Final cluster centroids representing cluster centers.
- Each data point assigned to one of the clusters.
Key points:
- Inertia (within-cluster sum of squares) is minimized: \mathrm{Inertia} = \sum{k=1}^K \sum{xi \in Ck} | xi - \muk |^2.
- Initialisation sensitivity can lead to different results; multiple restarts help.
- K is a hyperparameter that needs to be chosen prior to running the algorithm.
- K-Means is simple and scalable, which contributes to its popularity.
K-Means algorithm (steps):
1. Initialization:
- Choose the number of clusters K.
- Randomly initialize K cluster centroids in the data space.
1. Assignment Step:
- For each data point, compute the distance to each centroid.
- Assign the point to the nearest centroid.
- Note: Euclidean distance is the typical metric; other metrics can be used.
1. Update Step:
- Recompute the centroid of each cluster as the mean of all points assigned to that cluster.
1. Convergence Check:
- Check if centroids have stabilized (no significant change) or a maximum number of iterations is reached.
- If stabilized, stop; otherwise, repeat steps 2 and 3.
1. Output:
- Final centroids represent cluster centers.
- Each data point has an assigned cluster.
K-Means pseudocode (from transcript):
- Input: K (number of clusters), D (list of data points)
- 1. Choose K random data points as initial centroids.
- 2. Repeat until centroids stabilize:
- a. Allocate each point in D to the nearest of the K centroids.
- b. Compute the centroid for the cluster using all points in the cluster.
Practical notes on K-Means:
- Step-by-step flow mirrors standard practice; convergence is typically fast in practice for many datasets.
- The elbow method and silhouette method are used to help determine a reasonable K.
Determining the number of clusters (K) for K-Means:
- Two common methods:
- Elbow Method: Plot within-cluster sum of squares (WCSS) against K and look for a bend (elbow) where improvement slows.
- Silhouette Method: Use silhouette scores to assess how well each point lies within its cluster compared to neighboring clusters.
K-Means clustering workflow for choosing K (Elbow Method example):
- Step 1: Apply K-Means for several values of K (e.g., K = 1 to 10).
- Step 2: For each K, compute WCSS: \mathrm{WCSS} = \sum{di \in D} dist(di, Ck)^2 where Ck is the centroid of the cluster to which di belongs.
- Step 3: Plot WCSS versus K.
- Step 4: The elbow point (bend) on the plot suggests the approximate number of clusters.
- Notation from transcript visuals: K1, K2, K3 are example cluster counts; the elbow location indicates the optimal K.
Practical considerations about K selection and interpretation:
- The elbow point is not always unambiguous; multiple interpretations may exist.
- Silhouette scores provide a complementary perspective on cluster separation and compactness.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

Purpose: Density-based clustering that can identify clusters of arbitrary shape and handle noise; does not require specifying the number of clusters upfront.
Key parameters:
- ε (epsilon): Maximum distance between two points for them to be considered neighbors.
- MinPts: Minimum number of points required to form a dense region (including the point itself).
Core points, border points, and noise points:
- Core point: A point with at least MinPts points (including itself) within its ε-neighborhood.
- Border point: A point within the ε-neighborhood of a core point but not a core point itself.
- Noise point: Neither a core point nor a border point.
Cluster formation process:
- Start with an arbitrary unvisited point.
- Find all points within its ε-neighborhood and classify as core, border, or noise.
- If the selected point is a core point, recursively expand the cluster by including all connected core and border points within its ε-neighborhood.
- Repeat with another unvisited point until all points are visited.
Cluster assignment and output:
- Points are assigned to the cluster of their nearest core point if they are border points.
- Noise points remain unassigned.
- Output is a set of clusters plus identified noise points.
Key properties and advantages:
- Identifies clusters of arbitrary shape.
- Handles clusters of varying densities better than many other methods.
- Does not require specifying the number of clusters beforehand.
- Robust to noise and outliers (to a degree).
Limitations and considerations:
- Performance and results are sensitive to the choice of ε and MinPts; parameters may need tuning to match data characteristics.

Other Clustering Methods (brief descriptions from transcript)

Agglomerative Clustering: A hierarchical method starting with each data point as its own cluster and recursively merging clusters based on a linkage criterion (e.g., single linkage, complete linkage, average linkage).
Mean Shift Clustering: A non-parametric clustering technique that shifts centroids toward areas of higher data density to discover clusters.
BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies): A hierarchical clustering method designed for large datasets that incrementally builds a clustering hierarchy using a tree structure.
Gaussian Mixture Models (GMM): A model-based clustering approach assuming data are generated from a mixture of Gaussian distributions; it estimates the parameters of the distributions to identify clusters.
Fuzzy C-Means Clustering: A soft clustering algorithm where data points are assigned to clusters probabilistically, allowing multiple memberships with varying degrees.
Spectral Clustering: A graph-based technique that uses the eigenvalues (and eigenvectors) of a similarity matrix to partition data.
Affinity Propagation: A clustering method that identifies exemplars within the data and iteratively updates clusters via message passing.
Self-Organizing Maps (SOM): An unsupervised learning algorithm projecting high-dimensional data onto a low-dimensional grid to cluster similar points.
OPTICS (Ordering Points To Identify the Clustering Structure): Produces a hierarchical clustering structure and is more robust to varying densities within clusters.
Practical note: The choice of method depends on data characteristics, the desired clustering structure, computational considerations, and interpretability of results.

Practical Takeaways and Connections

Distance metrics matter: K-Means typically uses Euclidean distance; other metrics can be used depending on data and problem.
Cluster shape assumptions:
- K-Means assumes roughly spherical/hyperellipsoidal clusters; not ideal for non-hyperellipsoidal shapes.
- DBSCAN and OPTICS can detect irregular shapes and varied densities.
Model-based vs partitioning vs density-based:
- GMM is model-based, providing probabilistic cluster memberships.
- DBSCAN/OPTICS are density-based, robust to noise and capable of discovering clusters of arbitrary shapes.
- Spectral and hierarchical methods offer different perspectives on clustering structure.
Determining the number of clusters is a common challenge (K in K-Means; none in DBSCAN/OPTICS).
Practical implications:
- Initialisation and parameter choices can strongly influence results.
- Computational efficiency varies: K-Means scales well; DBSCAN/OPTICS can be slower on large datasets and require careful parameter tuning.
- Interpretability: Simpler methods like K-Means are easier to interpret but may be less flexible; more complex methods can capture nuanced structures but may be harder to explain.

Equations and Notation (as referenced in transcript)

Within-cluster sum of squares (WCSS) / inertia for K-Means:
- Inertia: \mathrm{Inertia} = \sum{k=1}^K \sum{xi \in Ck} | xi - \muk |^2
- Alternatively, in a compact form: \mathrm{WCSS} = \sum{di \in D} \text{dist}(di, C{\text{cluster}(di)})^2 where Ck is the centroid of cluster k and cluster(di) is the cluster containing di.
Step-wise K-Means updates rely on centroid computation and assignment based on distance to centroids.
For the Elbow Method: no explicit formula beyond WCSS; the method relies on plotting and identifying the elbow bend.
For DBSCAN: core concepts are defined via ε and MinPts, but no single formula is provided in the transcript beyond the distance-based neighborhood concept.

Connections to Foundational Concepts

Clustering vs classification: Clustering is unsupervised; labels are not provided a priori.
Distance metrics underpin most clustering decisions and affect cluster boundaries.
Dimensionality and density influence method performance; higher-dimensional spaces pose challenges for distance-based methods due to the curse of dimensionality.

Practical Tips for Exam Preparation

Be able to describe each method’s core idea, typical use cases, and key advantages/disadvantages.
Memorize the two primary K selection techniques and how to apply them (Elbow and Silhouette) with brief steps.
Know the standard K-Means update rules and the role of centroids as cluster means.
Understand DBSCAN core concepts (core/border/noise) and parameter meanings (ε, MinPts).
Recognize the kinds of data shapes and density patterns each method can handle well or poorly.
Be able to write and interpret the WCSS formula and explain the purpose of plotting WCSS versus K in the Elbow Method.