K-means and Clustering
Clustering: Overview
Basic methods covered: K-means, Hierarchical clustering
Extensions and related topics mentioned: K-means++, optimization problem (non-examinable), applications such as image clustering and segmentation
Source context: Lecture slides from Lin Guosheng, Nanyang Technological University
K-means: Core ideas
Goal: Partition data into K clusters by minimizing within-cluster variability
Key components: centroids (cluster centers) and assignment of points to the nearest centroid
Common distance measure: Euclidean distance (L2)
Other distance options mentioned: L1 distance, Cosine distance; there are additional distance functions beyond these
Visualization context: Examples include image clustering and segmentation tasks
Distances and distance-based assignment
Euclidean (L2) distance between x and y: d(x,y) = \sqrt{\sum{i=1}^n (xi - y_i)^2}
Squared Euclidean distance (often used in K-means to avoid sqrt): d^2(x,y) = \sum{i=1}^n (xi - y_i)^2
Important note: In exams, squared L2 distance is typically sufficient for comparisons (often no need to take the square root)
Distances are used for assignment: each point is assigned to the nearest centroid by comparing distances
K-means algorithm: step-by-step
Step 1: Specify the number of clusters to form, k
Step 2: Randomly initialize k centroids
Repeat until convergence:
Assignment step: assign each point to the nearest centroid
For a data point xs, find i* = argmini d^2(xs, ci)
Center update step: recompute each centroid as the mean of all points assigned to its cluster
Centroid update: \muk = \frac{1}{|Ck|} \sum{x \in Ck} x
Convergence criterion: stop when centroid positions do not change (or assignments stabilize)
Quick note: This process is Lloyd’s algorithm; exact global optimization is hard, but the method empirically converges in a few iterations
Simple worked example (2 clusters, 6 data points)
Initial data setup (2 clusters, k=2) and initial centroids selected from data points
Iteration 1:
Compute squared L2 distances from each point to each centroid
Assign points to nearest centroid (Cluster 0 or Cluster 1)
Update centroids by taking the mean of points in each cluster
Iteration 2 and beyond:
Recompute assignments and centroids
Convergence:
In the provided example, convergence occurs after a few iterations (the slide notes that the 5th assignment checks convergence, but practically 4 iterations may suffice)
Example results (from slides):
After iteration 1, Centroid updates yield new centroids: E = [1.5, 0], F = [-0.5, 1]
Assignment step examples: Cluster1 includes A and D; Cluster2 includes B and C
Distances used in example: squared L2 values shown in a table (A, B, C, D, E, F, with distances to centroids)
Key formulas from the example
Assignment criterion for a data point x to a centroid ci: i^* = \arg\mini |x - c_i|^2
Centroid update after assignment to cluster Ck: \muk = \frac{1}{|Ck|} \sum{x \in C_k} x
In the 2-cluster example, after first update:
E = [1.5, 0], F = [-0.5, 1]
Elapsed distances for illustration (Squared L2 distances) help illustrate the assignment changes between iterations
The Elbow method for choosing k
Idea: Evaluate clustering performance (average distance to centroid) for different k values
Procedure:
For each k, compute the average distance to centroids across all data points
Plot average distance to centroid vs k; look for an "elbow" where the rate of decrease sharply changes
Practical interpretation:
A small k yields large reductions in average distance as k increases; after a point, increases in k yield diminishing returns
The elbow point suggests a reasonable trade-off between compactness and model complexity
Example metric described in slides: for k=2, the average distance to centroids was computed as 2.25 in the specific dataset using squared L2 distances
K-means++: smarter initialization
Motivation: Random initialization of centroids can lead to poor clustering due to bad starting points
K-means++ is a standard approach that improves initialization by spreading centroids apart
High-level idea: choose initial centroids sequentially to maximize their separation
Simple (simplified) version:
1) Choose the first centroid randomly from the data
2) For each candidate x, compute D(x) = distance to the nearest already-chosen centroid
3) Select the next centroid as the data point with the largest D(x)
4) Repeat step 2-3 until k centroids are chosenFull (original) version:
1) For each candidate x, compute D(x) as the distance to the nearest existing centroid
2) Compute a probability p(x) proportional to D(x)^2 and sample the next centroid from the data points according to p(x)
3) Repeat until k centroids are chosenD(x) is a score indicating how far x is from existing centroids; larger D(x) means a better candidate for a new centroid
References and further reading provided in slides
Practical notes on distance choices and metrics
K-means commonly uses Euclidean distance (L2 distance) as the default
Other distance options include:
L1 distance (Manhattan)
Cosine distance
Other distance functions exist; the choice affects cluster shapes and performance
The squared L2 distance is often used in practice because it simplifies computations and preserves order for assignment
Pros and cons of K-means
Pros:
Simple and fast; easy to implement
Cons:
Requires specifying K (the number of clusters)
Sensitive to outliers
Assumes clusters are spherical (or convex) and roughly equal in size; performs poorly on data with non-spherical shapes
Sensitive to initial centroid placement (initialization matters)
Limitations and visualization of failure cases
K-means struggles with:
Non-spherical (arbitrary) cluster shapes
Data with significant outliers
Highly imbalanced cluster sizes
Failure case insights shown in slides:
Non-spherical shapes can lead to poor clustering results or fragmentation of natural clusters
Bad initialization can yield suboptimal clusterings
Many outliers can distort cluster boundaries and centroid positions
K-means vs. other clustering approaches (context)
K-means is contrasted with hierarchical clustering in basic clustering workflows
Hierarchical clustering provides a different paradigm (not elaborated in detail in these slides), often used for dendrogram-based cluster discovery
Clustroid: a variant of K-means
Concept: replace the centroid with a clustroid
Clustroid definition: the data point within a cluster that minimizes the average distance to all other points in the same cluster
Clustroid is an actual data point (not an artificial mean point like the centroid)
Comparison:
Centroid is the mean of cluster points (may not correspond to an actual data point)
Clustroid is an existing point in the cluster that best represents the cluster
Use case: when it is desirable to have a representative, real data point rather than a synthetic average
Clustering optimization perspective (Lloyd’s algorithm)
Objective is to minimize within-cluster variance (sum of squared distances to centroids)
Original optimization formulation (informal):
Given data X, cluster centers C = {c1,…,cK}, and assignments that map each x to a cluster, minimize the total within-cluster squared distance
Core components (as per slides):
Initialization: choose initial centers C1,…,CK
Assignment step: assign each x to the closest center (minimize distance to centers)
Centroid update step: update centers to be the mean of the assigned points within each cluster
Repeat until convergence
Equation of the optimization objective (formal form):
\min{C, {Ck}} \sum{k=1}^K \sum{x \in Ck} |x - \muk|^2
Subject to: cluster assignments and centroid locations as defined by the steps above
Algorithmic interpretation: this framework is often called Lloyd’s algorithm
Practical implementation notes
Hardness of global optimization: Finding the global optimum of the clustering objective is generally intractable
Empirical behavior: K-means usually converges after a few iterations in practice
Mini-batch K-means for large-scale data: process small random subsets (mini-batches) to update centroids gradually via stochastic optimization
Sklearn implementation reference for MiniBatchKMeans and related online tools/methods
K-means in image segmentation and superpixels
Concept: treat image pixels as data points in feature space (e.g., intensity, color, or texture features)
Goal: segment the image by clustering pixels into a small number of groups
SLIC (simple linear iterative clustering) is a widely used superpixel algorithm related to segmentation via clustering
Segmentation as clustering: k is the number of segments (e.g., k = 2 for binary segmentation; k = 3 for a three-way split)
Practical takeaway: clustering-based approaches can be used to group pixels into superpixels that align with perceptual boundaries
References to SLIC: EPFL and related resources cited in slides
Summary of practical guidance for exams and applications
Always check whether to use squared L2 distance for efficiency and equivalence in ordering during assignment
Start with a reasonable k using the elbow method and/or domain knowledge; use K-means++ for better initialization
Be aware of the limitations: outliers, non-spherical shapes, and initialization sensitivity
For large-scale data, consider Mini-batch K-means as a scalable variant
When representing a cluster with a real data point, consider clustroid as an alternative to centroid
For image segmentation tasks, clustering ideas extend to segmentation techniques like SLIC to generate superpixels
Quick reference formulas
Squared L2 distance: d^2(x, y) = \sum{i=1}^n (xi - y_i)^2
Euclidean distance: d(x, y) = \sqrt{\sum{i=1}^n (xi - y_i)^2}
Centroid update (mean): \muk = \frac{1}{|Ck|} \sum{x \in Ck} x
K-means objective (within-cluster sum of squares): J = \sum{k=1}^K \sum{x \in Ck} |x - \muk|^2
Assignment rule: for a data point x, assign to cluster i* where i^* = \arg\mini |x - ci|^2
K-means++ ideas:
D(x) = distance to nearest existing centroid
New centroid selection favors points with large D(x) values
Probability version: p(x) proportional to D(x)^2 for selecting the next centroid
References and additional resources mentioned in the slides
K-means++ original description and implementations
Stanford CS131 optimization/LLoyd’s algorithm notes
scikit-learn documentation for MiniBatchKMeans and clustering examples
Image segmentation references: SLIC superpixels, image clustering tutorials
Elbow method and clustering evaluation resources