N

Clustering Algorithms - Key Vocabulary

Clustering Methods Overview

  • Clustering methods have different characteristics, advantages, and limitations. Major methods include:
    • K-Means Clustering: Partitioning algorithm that divides data into K non-overlapping clusters based on similarity; objective is to minimize the within-cluster sum of squares.
    • Hierarchical Clustering: Builds a hierarchy of clusters by recursively merging or splitting clusters; can be agglomerative (bottom-up) or divisive (top-down).
    • DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Density-based approach that groups together points in dense regions without requiring a pre-specified number of clusters; robust to noise.
    • OPTICS (Ordering Points To Identify the Clustering Structure): Extension of DBSCAN that produces a hierarchical clustering structure and handles varying densities better.
    • Mean Shift Clustering: Non-parametric method that shifts centroids toward areas of higher data density to identify clusters.
    • Agglomerative Clustering: A hierarchical method that starts with individual points and repeatedly merges clusters using a linkage criterion (e.g., single, complete, average).
    • BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies): Designed for large datasets; incrementally builds a clustering hierarchy using a tree structure.
    • Gaussian Mixture Models (GMM): Model-based clustering assuming data come from a mixture of Gaussian distributions; parameters are estimated to identify clusters.
    • Fuzzy C-Means Clustering: Soft clustering where points have degrees of membership to multiple clusters.
    • Spectral Clustering: Graph-based method using eigenvalues of a similarity matrix to partition data.
    • Affinity Propagation: Identifies exemplars (representative points) by message passing between data points.
    • Self-Organizing Maps (SOM): Unsupervised learning that projects high-dimensional data onto a low-dimensional grid and clusters similar points.
  • These are examples; many variations exist. Choice depends on data nature, desired clustering structure, computation, and interpretability.

K-Means Clustering

  • Overview: Popular unsupervised algorithm for clustering similar data points into K groups; aims to minimize the within-cluster sum of squares (inertia).

  • Distance metric: Typically uses Euclidean distance, though other metrics can be used.

  • Key objective: Minimize inertia, the sum of squared distances of samples to their assigned cluster centroids.

  • Common reasons for use: Simple, scalable, widely understood.

  • Key concepts:

    • Centroids: Each cluster has a centroid representing its center; for a cluster C, centroid μ is the mean of all points in C.
    • Assignment based on proximity: Each point is assigned to the nearest centroid.
    • Initialisation sensitivity: Outcomes depend on initial centroid positions; different runs can yield different clusters.
    • Hyperparameter K: The number of clusters must be specified in advance.
    • Mitigation for sensitivity: Run the algorithm multiple times with different initializations and choose the clustering with the lowest inertia.
  • Outputs after convergence:

    • Final cluster centroids representing cluster centers.
    • Each data point assigned to one of the clusters.
  • Key points:

    • Inertia (within-cluster sum of squares) is minimized: \mathrm{Inertia} = \sum{k=1}^K \sum{xi \in Ck} | xi - \muk |^2.
    • Initialisation sensitivity can lead to different results; multiple restarts help.
    • K is a hyperparameter that needs to be chosen prior to running the algorithm.
    • K-Means is simple and scalable, which contributes to its popularity.
  • K-Means algorithm (steps):

    1. Initialization:
    • Choose the number of clusters K.
    • Randomly initialize K cluster centroids in the data space.
    1. Assignment Step:
    • For each data point, compute the distance to each centroid.
    • Assign the point to the nearest centroid.
    • Note: Euclidean distance is the typical metric; other metrics can be used.
    1. Update Step:
    • Recompute the centroid of each cluster as the mean of all points assigned to that cluster.
    1. Convergence Check:
    • Check if centroids have stabilized (no significant change) or a maximum number of iterations is reached.
    • If stabilized, stop; otherwise, repeat steps 2 and 3.
    1. Output:
    • Final centroids represent cluster centers.
    • Each data point has an assigned cluster.
  • K-Means pseudocode (from transcript):

    • Input: K (number of clusters), D (list of data points)
    • 1. Choose K random data points as initial centroids.
    • 2. Repeat until centroids stabilize:
    • a. Allocate each point in D to the nearest of the K centroids.
    • b. Compute the centroid for the cluster using all points in the cluster.
  • Practical notes on K-Means:

    • Step-by-step flow mirrors standard practice; convergence is typically fast in practice for many datasets.
    • The elbow method and silhouette method are used to help determine a reasonable K.
  • Determining the number of clusters (K) for K-Means:

    • Two common methods:
    • Elbow Method: Plot within-cluster sum of squares (WCSS) against K and look for a bend (elbow) where improvement slows.
    • Silhouette Method: Use silhouette scores to assess how well each point lies within its cluster compared to neighboring clusters.
  • K-Means clustering workflow for choosing K (Elbow Method example):

    • Step 1: Apply K-Means for several values of K (e.g., K = 1 to 10).
    • Step 2: For each K, compute WCSS: \mathrm{WCSS} = \sum{di \in D} dist(di, Ck)^2 where Ck is the centroid of the cluster to which di belongs.
    • Step 3: Plot WCSS versus K.
    • Step 4: The elbow point (bend) on the plot suggests the approximate number of clusters.
    • Notation from transcript visuals: K1, K2, K3 are example cluster counts; the elbow location indicates the optimal K.
  • Practical considerations about K selection and interpretation:

    • The elbow point is not always unambiguous; multiple interpretations may exist.
    • Silhouette scores provide a complementary perspective on cluster separation and compactness.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

  • Purpose: Density-based clustering that can identify clusters of arbitrary shape and handle noise; does not require specifying the number of clusters upfront.
  • Key parameters:
    • ε (epsilon): Maximum distance between two points for them to be considered neighbors.
    • MinPts: Minimum number of points required to form a dense region (including the point itself).
  • Core points, border points, and noise points:
    • Core point: A point with at least MinPts points (including itself) within its ε-neighborhood.
    • Border point: A point within the ε-neighborhood of a core point but not a core point itself.
    • Noise point: Neither a core point nor a border point.
  • Cluster formation process:
    • Start with an arbitrary unvisited point.
    • Find all points within its ε-neighborhood and classify as core, border, or noise.
    • If the selected point is a core point, recursively expand the cluster by including all connected core and border points within its ε-neighborhood.
    • Repeat with another unvisited point until all points are visited.
  • Cluster assignment and output:
    • Points are assigned to the cluster of their nearest core point if they are border points.
    • Noise points remain unassigned.
    • Output is a set of clusters plus identified noise points.
  • Key properties and advantages:
    • Identifies clusters of arbitrary shape.
    • Handles clusters of varying densities better than many other methods.
    • Does not require specifying the number of clusters beforehand.
    • Robust to noise and outliers (to a degree).
  • Limitations and considerations:
    • Performance and results are sensitive to the choice of ε and MinPts; parameters may need tuning to match data characteristics.

Other Clustering Methods (brief descriptions from transcript)

  • Agglomerative Clustering: A hierarchical method starting with each data point as its own cluster and recursively merging clusters based on a linkage criterion (e.g., single linkage, complete linkage, average linkage).
  • Mean Shift Clustering: A non-parametric clustering technique that shifts centroids toward areas of higher data density to discover clusters.
  • BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies): A hierarchical clustering method designed for large datasets that incrementally builds a clustering hierarchy using a tree structure.
  • Gaussian Mixture Models (GMM): A model-based clustering approach assuming data are generated from a mixture of Gaussian distributions; it estimates the parameters of the distributions to identify clusters.
  • Fuzzy C-Means Clustering: A soft clustering algorithm where data points are assigned to clusters probabilistically, allowing multiple memberships with varying degrees.
  • Spectral Clustering: A graph-based technique that uses the eigenvalues (and eigenvectors) of a similarity matrix to partition data.
  • Affinity Propagation: A clustering method that identifies exemplars within the data and iteratively updates clusters via message passing.
  • Self-Organizing Maps (SOM): An unsupervised learning algorithm projecting high-dimensional data onto a low-dimensional grid to cluster similar points.
  • OPTICS (Ordering Points To Identify the Clustering Structure): Produces a hierarchical clustering structure and is more robust to varying densities within clusters.
  • Practical note: The choice of method depends on data characteristics, the desired clustering structure, computational considerations, and interpretability of results.

Practical Takeaways and Connections

  • Distance metrics matter: K-Means typically uses Euclidean distance; other metrics can be used depending on data and problem.
  • Cluster shape assumptions:
    • K-Means assumes roughly spherical/hyperellipsoidal clusters; not ideal for non-hyperellipsoidal shapes.
    • DBSCAN and OPTICS can detect irregular shapes and varied densities.
  • Model-based vs partitioning vs density-based:
    • GMM is model-based, providing probabilistic cluster memberships.
    • DBSCAN/OPTICS are density-based, robust to noise and capable of discovering clusters of arbitrary shapes.
    • Spectral and hierarchical methods offer different perspectives on clustering structure.
  • Determining the number of clusters is a common challenge (K in K-Means; none in DBSCAN/OPTICS).
  • Practical implications:
    • Initialisation and parameter choices can strongly influence results.
    • Computational efficiency varies: K-Means scales well; DBSCAN/OPTICS can be slower on large datasets and require careful parameter tuning.
    • Interpretability: Simpler methods like K-Means are easier to interpret but may be less flexible; more complex methods can capture nuanced structures but may be harder to explain.

Equations and Notation (as referenced in transcript)

  • Within-cluster sum of squares (WCSS) / inertia for K-Means:

    • Inertia: \mathrm{Inertia} = \sum{k=1}^K \sum{xi \in Ck} | xi - \muk |^2
    • Alternatively, in a compact form: \mathrm{WCSS} = \sum{di \in D} \text{dist}(di, C{\text{cluster}(di)})^2 where Ck is the centroid of cluster k and cluster(di) is the cluster containing di.
  • Step-wise K-Means updates rely on centroid computation and assignment based on distance to centroids.

  • For the Elbow Method: no explicit formula beyond WCSS; the method relies on plotting and identifying the elbow bend.

  • For DBSCAN: core concepts are defined via ε and MinPts, but no single formula is provided in the transcript beyond the distance-based neighborhood concept.

Connections to Foundational Concepts

  • Clustering vs classification: Clustering is unsupervised; labels are not provided a priori.
  • Distance metrics underpin most clustering decisions and affect cluster boundaries.
  • Dimensionality and density influence method performance; higher-dimensional spaces pose challenges for distance-based methods due to the curse of dimensionality.

Practical Tips for Exam Preparation

  • Be able to describe each method’s core idea, typical use cases, and key advantages/disadvantages.
  • Memorize the two primary K selection techniques and how to apply them (Elbow and Silhouette) with brief steps.
  • Know the standard K-Means update rules and the role of centroids as cluster means.
  • Understand DBSCAN core concepts (core/border/noise) and parameter meanings (ε, MinPts).
  • Recognize the kinds of data shapes and density patterns each method can handle well or poorly.
  • Be able to write and interpret the WCSS formula and explain the purpose of plotting WCSS versus K in the Elbow Method.