K-means and Clustering

Clustering: Overview

  • Basic methods covered: K-means, Hierarchical clustering

  • Extensions and related topics mentioned: K-means++, optimization problem (non-examinable), applications such as image clustering and segmentation

  • Source context: Lecture slides from Lin Guosheng, Nanyang Technological University

K-means: Core ideas

  • Goal: Partition data into K clusters by minimizing within-cluster variability

  • Key components: centroids (cluster centers) and assignment of points to the nearest centroid

  • Common distance measure: Euclidean distance (L2)

  • Other distance options mentioned: L1 distance, Cosine distance; there are additional distance functions beyond these

  • Visualization context: Examples include image clustering and segmentation tasks

Distances and distance-based assignment

  • Euclidean (L2) distance between x and y: d(x,y) = \sqrt{\sum{i=1}^n (xi - y_i)^2}

  • Squared Euclidean distance (often used in K-means to avoid sqrt): d^2(x,y) = \sum{i=1}^n (xi - y_i)^2

  • Important note: In exams, squared L2 distance is typically sufficient for comparisons (often no need to take the square root)

  • Distances are used for assignment: each point is assigned to the nearest centroid by comparing distances

K-means algorithm: step-by-step

  • Step 1: Specify the number of clusters to form, k

  • Step 2: Randomly initialize k centroids

  • Repeat until convergence:

    • Assignment step: assign each point to the nearest centroid

    • For a data point xs, find i* = argmini d^2(xs, ci)

    • Center update step: recompute each centroid as the mean of all points assigned to its cluster

    • Centroid update: \muk = \frac{1}{|Ck|} \sum{x \in Ck} x

  • Convergence criterion: stop when centroid positions do not change (or assignments stabilize)

  • Quick note: This process is Lloyd’s algorithm; exact global optimization is hard, but the method empirically converges in a few iterations

Simple worked example (2 clusters, 6 data points)

  • Initial data setup (2 clusters, k=2) and initial centroids selected from data points

  • Iteration 1:

    • Compute squared L2 distances from each point to each centroid

    • Assign points to nearest centroid (Cluster 0 or Cluster 1)

    • Update centroids by taking the mean of points in each cluster

  • Iteration 2 and beyond:

    • Recompute assignments and centroids

  • Convergence:

    • In the provided example, convergence occurs after a few iterations (the slide notes that the 5th assignment checks convergence, but practically 4 iterations may suffice)

  • Example results (from slides):

    • After iteration 1, Centroid updates yield new centroids: E = [1.5, 0], F = [-0.5, 1]

    • Assignment step examples: Cluster1 includes A and D; Cluster2 includes B and C

  • Distances used in example: squared L2 values shown in a table (A, B, C, D, E, F, with distances to centroids)

Key formulas from the example

  • Assignment criterion for a data point x to a centroid ci: i^* = \arg\mini |x - c_i|^2

  • Centroid update after assignment to cluster Ck: \muk = \frac{1}{|Ck|} \sum{x \in C_k} x

  • In the 2-cluster example, after first update:

    • E = [1.5, 0], F = [-0.5, 1]

  • Elapsed distances for illustration (Squared L2 distances) help illustrate the assignment changes between iterations

The Elbow method for choosing k

  • Idea: Evaluate clustering performance (average distance to centroid) for different k values

  • Procedure:

    • For each k, compute the average distance to centroids across all data points

    • Plot average distance to centroid vs k; look for an "elbow" where the rate of decrease sharply changes

  • Practical interpretation:

    • A small k yields large reductions in average distance as k increases; after a point, increases in k yield diminishing returns

    • The elbow point suggests a reasonable trade-off between compactness and model complexity

  • Example metric described in slides: for k=2, the average distance to centroids was computed as 2.25 in the specific dataset using squared L2 distances

K-means++: smarter initialization

  • Motivation: Random initialization of centroids can lead to poor clustering due to bad starting points

  • K-means++ is a standard approach that improves initialization by spreading centroids apart

  • High-level idea: choose initial centroids sequentially to maximize their separation

  • Simple (simplified) version:
    1) Choose the first centroid randomly from the data
    2) For each candidate x, compute D(x) = distance to the nearest already-chosen centroid
    3) Select the next centroid as the data point with the largest D(x)
    4) Repeat step 2-3 until k centroids are chosen

  • Full (original) version:
    1) For each candidate x, compute D(x) as the distance to the nearest existing centroid
    2) Compute a probability p(x) proportional to D(x)^2 and sample the next centroid from the data points according to p(x)
    3) Repeat until k centroids are chosen

  • D(x) is a score indicating how far x is from existing centroids; larger D(x) means a better candidate for a new centroid

  • References and further reading provided in slides

Practical notes on distance choices and metrics

  • K-means commonly uses Euclidean distance (L2 distance) as the default

  • Other distance options include:

    • L1 distance (Manhattan)

    • Cosine distance

    • Other distance functions exist; the choice affects cluster shapes and performance

  • The squared L2 distance is often used in practice because it simplifies computations and preserves order for assignment

Pros and cons of K-means

  • Pros:

    • Simple and fast; easy to implement

  • Cons:

    • Requires specifying K (the number of clusters)

    • Sensitive to outliers

    • Assumes clusters are spherical (or convex) and roughly equal in size; performs poorly on data with non-spherical shapes

    • Sensitive to initial centroid placement (initialization matters)

Limitations and visualization of failure cases

  • K-means struggles with:

    • Non-spherical (arbitrary) cluster shapes

    • Data with significant outliers

    • Highly imbalanced cluster sizes

  • Failure case insights shown in slides:

    • Non-spherical shapes can lead to poor clustering results or fragmentation of natural clusters

    • Bad initialization can yield suboptimal clusterings

    • Many outliers can distort cluster boundaries and centroid positions

K-means vs. other clustering approaches (context)

  • K-means is contrasted with hierarchical clustering in basic clustering workflows

  • Hierarchical clustering provides a different paradigm (not elaborated in detail in these slides), often used for dendrogram-based cluster discovery

Clustroid: a variant of K-means

  • Concept: replace the centroid with a clustroid

  • Clustroid definition: the data point within a cluster that minimizes the average distance to all other points in the same cluster

    • Clustroid is an actual data point (not an artificial mean point like the centroid)

  • Comparison:

    • Centroid is the mean of cluster points (may not correspond to an actual data point)

    • Clustroid is an existing point in the cluster that best represents the cluster

  • Use case: when it is desirable to have a representative, real data point rather than a synthetic average

Clustering optimization perspective (Lloyd’s algorithm)

  • Objective is to minimize within-cluster variance (sum of squared distances to centroids)

  • Original optimization formulation (informal):

    • Given data X, cluster centers C = {c1,…,cK}, and assignments that map each x to a cluster, minimize the total within-cluster squared distance

  • Core components (as per slides):

    • Initialization: choose initial centers C1,…,CK

    • Assignment step: assign each x to the closest center (minimize distance to centers)

    • Centroid update step: update centers to be the mean of the assigned points within each cluster

    • Repeat until convergence

  • Equation of the optimization objective (formal form):

    • \min{C, {Ck}} \sum{k=1}^K \sum{x \in Ck} |x - \muk|^2

    • Subject to: cluster assignments and centroid locations as defined by the steps above

  • Algorithmic interpretation: this framework is often called Lloyd’s algorithm

Practical implementation notes

  • Hardness of global optimization: Finding the global optimum of the clustering objective is generally intractable

  • Empirical behavior: K-means usually converges after a few iterations in practice

  • Mini-batch K-means for large-scale data: process small random subsets (mini-batches) to update centroids gradually via stochastic optimization

  • Sklearn implementation reference for MiniBatchKMeans and related online tools/methods

K-means in image segmentation and superpixels

  • Concept: treat image pixels as data points in feature space (e.g., intensity, color, or texture features)

  • Goal: segment the image by clustering pixels into a small number of groups

  • SLIC (simple linear iterative clustering) is a widely used superpixel algorithm related to segmentation via clustering

  • Segmentation as clustering: k is the number of segments (e.g., k = 2 for binary segmentation; k = 3 for a three-way split)

  • Practical takeaway: clustering-based approaches can be used to group pixels into superpixels that align with perceptual boundaries

  • References to SLIC: EPFL and related resources cited in slides

Summary of practical guidance for exams and applications

  • Always check whether to use squared L2 distance for efficiency and equivalence in ordering during assignment

  • Start with a reasonable k using the elbow method and/or domain knowledge; use K-means++ for better initialization

  • Be aware of the limitations: outliers, non-spherical shapes, and initialization sensitivity

  • For large-scale data, consider Mini-batch K-means as a scalable variant

  • When representing a cluster with a real data point, consider clustroid as an alternative to centroid

  • For image segmentation tasks, clustering ideas extend to segmentation techniques like SLIC to generate superpixels

Quick reference formulas

  • Squared L2 distance: d^2(x, y) = \sum{i=1}^n (xi - y_i)^2

  • Euclidean distance: d(x, y) = \sqrt{\sum{i=1}^n (xi - y_i)^2}

  • Centroid update (mean): \muk = \frac{1}{|Ck|} \sum{x \in Ck} x

  • K-means objective (within-cluster sum of squares): J = \sum{k=1}^K \sum{x \in Ck} |x - \muk|^2

  • Assignment rule: for a data point x, assign to cluster i* where i^* = \arg\mini |x - ci|^2

  • K-means++ ideas:

    • D(x) = distance to nearest existing centroid

    • New centroid selection favors points with large D(x) values

    • Probability version: p(x) proportional to D(x)^2 for selecting the next centroid

References and additional resources mentioned in the slides

  • K-means++ original description and implementations

  • Stanford CS131 optimization/LLoyd’s algorithm notes

  • scikit-learn documentation for MiniBatchKMeans and clustering examples

  • Image segmentation references: SLIC superpixels, image clustering tutorials

  • Elbow method and clustering evaluation resources

End of notes