K-means and Clustering

Clustering: Overview

Basic methods covered: K-means, Hierarchical clustering
Extensions and related topics mentioned: K-means++, optimization problem (non-examinable), applications such as image clustering and segmentation
Source context: Lecture slides from Lin Guosheng, Nanyang Technological University

K-means: Core ideas

Goal: Partition data into K clusters by minimizing within-cluster variability
Key components: centroids (cluster centers) and assignment of points to the nearest centroid
Common distance measure: Euclidean distance (L2)
Other distance options mentioned: L1 distance, Cosine distance; there are additional distance functions beyond these
Visualization context: Examples include image clustering and segmentation tasks

Distances and distance-based assignment

Euclidean (L2) distance between x and y: d(x,y) = \sqrt{\sum{i=1}^n (xi - y_i)^2}
Squared Euclidean distance (often used in K-means to avoid sqrt): d^2(x,y) = \sum{i=1}^n (xi - y_i)^2
Important note: In exams, squared L2 distance is typically sufficient for comparisons (often no need to take the square root)
Distances are used for assignment: each point is assigned to the nearest centroid by comparing distances

K-means algorithm: step-by-step

Step 1: Specify the number of clusters to form, k
Step 2: Randomly initialize k centroids
Repeat until convergence:
- Assignment step: assign each point to the nearest centroid
- For a data point xs, find i* = argmini d^2(xs, ci)
- Center update step: recompute each centroid as the mean of all points assigned to its cluster
- Centroid update: \muk = \frac{1}{|Ck|} \sum{x \in Ck} x
Convergence criterion: stop when centroid positions do not change (or assignments stabilize)
Quick note: This process is Lloyd’s algorithm; exact global optimization is hard, but the method empirically converges in a few iterations

Simple worked example (2 clusters, 6 data points)

Initial data setup (2 clusters, k=2) and initial centroids selected from data points
Iteration 1:
- Compute squared L2 distances from each point to each centroid
- Assign points to nearest centroid (Cluster 0 or Cluster 1)
- Update centroids by taking the mean of points in each cluster
Iteration 2 and beyond:
- Recompute assignments and centroids
Convergence:
- In the provided example, convergence occurs after a few iterations (the slide notes that the 5th assignment checks convergence, but practically 4 iterations may suffice)
Example results (from slides):
- After iteration 1, Centroid updates yield new centroids: E = [1.5, 0], F = [-0.5, 1]
- Assignment step examples: Cluster1 includes A and D; Cluster2 includes B and C
Distances used in example: squared L2 values shown in a table (A, B, C, D, E, F, with distances to centroids)

Key formulas from the example

Assignment criterion for a data point x to a centroid ci: i^* = \arg\mini |x - c_i|^2
Centroid update after assignment to cluster Ck: \muk = \frac{1}{|Ck|} \sum{x \in C_k} x
In the 2-cluster example, after first update:
- E = [1.5, 0], F = [-0.5, 1]
Elapsed distances for illustration (Squared L2 distances) help illustrate the assignment changes between iterations

The Elbow method for choosing k

Idea: Evaluate clustering performance (average distance to centroid) for different k values
Procedure:
- For each k, compute the average distance to centroids across all data points
- Plot average distance to centroid vs k; look for an "elbow" where the rate of decrease sharply changes
Practical interpretation:
- A small k yields large reductions in average distance as k increases; after a point, increases in k yield diminishing returns
- The elbow point suggests a reasonable trade-off between compactness and model complexity
Example metric described in slides: for k=2, the average distance to centroids was computed as 2.25 in the specific dataset using squared L2 distances

K-means++: smarter initialization

Motivation: Random initialization of centroids can lead to poor clustering due to bad starting points
K-means++ is a standard approach that improves initialization by spreading centroids apart
High-level idea: choose initial centroids sequentially to maximize their separation
Simple (simplified) version:
1) Choose the first centroid randomly from the data
2) For each candidate x, compute D(x) = distance to the nearest already-chosen centroid
3) Select the next centroid as the data point with the largest D(x)
4) Repeat step 2-3 until k centroids are chosen
Full (original) version:
1) For each candidate x, compute D(x) as the distance to the nearest existing centroid
2) Compute a probability p(x) proportional to D(x)^2 and sample the next centroid from the data points according to p(x)
3) Repeat until k centroids are chosen
D(x) is a score indicating how far x is from existing centroids; larger D(x) means a better candidate for a new centroid
References and further reading provided in slides

Practical notes on distance choices and metrics

K-means commonly uses Euclidean distance (L2 distance) as the default
Other distance options include:
- L1 distance (Manhattan)
- Cosine distance
- Other distance functions exist; the choice affects cluster shapes and performance
The squared L2 distance is often used in practice because it simplifies computations and preserves order for assignment

Pros and cons of K-means

Pros:
- Simple and fast; easy to implement
Cons:
- Requires specifying K (the number of clusters)
- Sensitive to outliers
- Assumes clusters are spherical (or convex) and roughly equal in size; performs poorly on data with non-spherical shapes
- Sensitive to initial centroid placement (initialization matters)

Limitations and visualization of failure cases

K-means struggles with:
- Non-spherical (arbitrary) cluster shapes
- Data with significant outliers
- Highly imbalanced cluster sizes
Failure case insights shown in slides:
- Non-spherical shapes can lead to poor clustering results or fragmentation of natural clusters
- Bad initialization can yield suboptimal clusterings
- Many outliers can distort cluster boundaries and centroid positions

K-means vs. other clustering approaches (context)

K-means is contrasted with hierarchical clustering in basic clustering workflows
Hierarchical clustering provides a different paradigm (not elaborated in detail in these slides), often used for dendrogram-based cluster discovery

Clustroid: a variant of K-means

Concept: replace the centroid with a clustroid
Clustroid definition: the data point within a cluster that minimizes the average distance to all other points in the same cluster
- Clustroid is an actual data point (not an artificial mean point like the centroid)
Comparison:
- Centroid is the mean of cluster points (may not correspond to an actual data point)
- Clustroid is an existing point in the cluster that best represents the cluster
Use case: when it is desirable to have a representative, real data point rather than a synthetic average

Clustering optimization perspective (Lloyd’s algorithm)

Objective is to minimize within-cluster variance (sum of squared distances to centroids)
Original optimization formulation (informal):
- Given data X, cluster centers C = {c1,…,cK}, and assignments that map each x to a cluster, minimize the total within-cluster squared distance
Core components (as per slides):
- Initialization: choose initial centers C1,…,CK
- Assignment step: assign each x to the closest center (minimize distance to centers)
- Centroid update step: update centers to be the mean of the assigned points within each cluster
- Repeat until convergence
Equation of the optimization objective (formal form):
- \min{C, {Ck}} \sum{k=1}^K \sum{x \in Ck} |x - \muk|^2
- Subject to: cluster assignments and centroid locations as defined by the steps above
Algorithmic interpretation: this framework is often called Lloyd’s algorithm

Practical implementation notes

Hardness of global optimization: Finding the global optimum of the clustering objective is generally intractable
Empirical behavior: K-means usually converges after a few iterations in practice
Mini-batch K-means for large-scale data: process small random subsets (mini-batches) to update centroids gradually via stochastic optimization
Sklearn implementation reference for MiniBatchKMeans and related online tools/methods

K-means in image segmentation and superpixels

Concept: treat image pixels as data points in feature space (e.g., intensity, color, or texture features)
Goal: segment the image by clustering pixels into a small number of groups
SLIC (simple linear iterative clustering) is a widely used superpixel algorithm related to segmentation via clustering
Segmentation as clustering: k is the number of segments (e.g., k = 2 for binary segmentation; k = 3 for a three-way split)
Practical takeaway: clustering-based approaches can be used to group pixels into superpixels that align with perceptual boundaries
References to SLIC: EPFL and related resources cited in slides

Summary of practical guidance for exams and applications

Always check whether to use squared L2 distance for efficiency and equivalence in ordering during assignment
Start with a reasonable k using the elbow method and/or domain knowledge; use K-means++ for better initialization
Be aware of the limitations: outliers, non-spherical shapes, and initialization sensitivity
For large-scale data, consider Mini-batch K-means as a scalable variant
When representing a cluster with a real data point, consider clustroid as an alternative to centroid
For image segmentation tasks, clustering ideas extend to segmentation techniques like SLIC to generate superpixels

Quick reference formulas

Squared L2 distance: d^2(x, y) = \sum{i=1}^n (xi - y_i)^2
Euclidean distance: d(x, y) = \sqrt{\sum{i=1}^n (xi - y_i)^2}
Centroid update (mean): \muk = \frac{1}{|Ck|} \sum{x \in Ck} x
K-means objective (within-cluster sum of squares): J = \sum{k=1}^K \sum{x \in Ck} |x - \muk|^2
Assignment rule: for a data point x, assign to cluster i* where i^* = \arg\mini |x - ci|^2
K-means++ ideas:
- D(x) = distance to nearest existing centroid
- New centroid selection favors points with large D(x) values
- Probability version: p(x) proportional to D(x)^2 for selecting the next centroid

References and additional resources mentioned in the slides

K-means++ original description and implementations
Stanford CS131 optimization/LLoyd’s algorithm notes
scikit-learn documentation for MiniBatchKMeans and clustering examples
Image segmentation references: SLIC superpixels, image clustering tutorials
Elbow method and clustering evaluation resources

K-means and Clustering

Clustering: Overview

K-means: Core ideas

Distances and distance-based assignment

K-means algorithm: step-by-step

Simple worked example (2 clusters, 6 data points)

Key formulas from the example

The Elbow method for choosing k

K-means++: smarter initialization

Practical notes on distance choices and metrics

Pros and cons of K-means

Limitations and visualization of failure cases

K-means vs. other clustering approaches (context)

Clustroid: a variant of K-means

Clustering optimization perspective (Lloyd’s algorithm)

Practical implementation notes

K-means in image segmentation and superpixels

Summary of practical guidance for exams and applications

Quick reference formulas

References and additional resources mentioned in the slides

End of notes