Hierarchical Clustering

A bottom-up approach to hierarchical clustering.
Each data point starts as its own cluster; clusters are iteratively merged until a stopping condition is met.
Alternative: Divisive (top-down) approach starts with all points in one cluster and recursively splits into smaller clusters.

Initialization: Each data point is its own cluster (N clusters for N points).
Repeat until stopping criterion:
- Find the two most similar (nearest) clusters.
- Merge them into a new (parent) cluster.
Stopping criteria:
- Stop when a predefined number of clusters k is reached.
- Or stop when there is only one cluster.

Data points (index: [x, y]):
- 0: [2.25, 6.48]
- 1: [5.49, 3.32]
- 2: [4.45, 7.65]
- 3: [7.66, 1.73]
- 4: [3.83, 2.89]
- 5: [0.94, 4.41]
- 6: [4.62, 4.28]
- 7: [1.06, 4.03]
Initial state (each point its own cluster):
- Cluster 1: [0]
- Cluster 2: [1]
- Cluster 3: [2]
- Cluster 4: [3]
- Cluster 5: [4]
- Cluster 6: [5]
- Cluster 7: [6]
- Cluster 8: [7]
First merge: two nearest clusters merged into a new cluster
- Example after first merge: Cluster 6: [5, 7]
- Remaining clusters: [0], [1], [2], [3], [4], [5,7], [6]
- Dendrogram is used to describe the clustering process as a tree
Second merge: next nearest clusters merged
- Example after second merge: Cluster 2: [1, 6]
Third merge: continue merging nearest clusters
- Example after third merge: Cluster 2 becomes [1, 4, 6]
Step-wise clustering progression (illustrative sequence):
- Step 5 (4 clusters): Cluster 1: [0, 5, 7], Cluster 2: [1, 4, 6], Cluster 3: [2], Cluster 4: [3]
- Step 6 (3 clusters): Cluster 1: [0, 2, 5, 7], Cluster 2: [1, 4, 6], Cluster 3: [3]
- Step 7 (2 clusters): Cluster 1: [0, 2, 5, 7], Cluster 2: [1, 3, 4, 6]
- Step 8 (1 cluster): Cluster 1: [0, 1, 2, 3, 4, 5, 6, 7]
Dendrogram
- A tree-like diagram showing hierarchical relationships between data points and clusters.
- Used to describe the clustering process in a convenient visual form.

The transcript asks: What is missing here? How to define the similarity of two clusters?
In hierarchical clustering, cluster similarity is defined via linkage criteria. Common options are:
- Single linkage
- Complete linkage
- Average linkage
- Centroid method

Single linkage
- Definition: $D(C1, C2) = \min{x \in C1, y \in C_2} D(x, y)$
- Interpretation: distance between clusters is the minimum distance between any pair of points, one from each cluster.
Complete linkage
- Definition: $D(C1, C2) = \max{x \in C1, y \in C_2} D(x, y)$
- Interpretation: distance between clusters is the maximum distance between any pair of points, one from each cluster.
Average linkage
- Definition: $D(C1, C2) = \frac{1}{|C1| \; |C2|} \sum{x \in C1} \sum{y \in C2} D(x, y)$
- Interpretation: average distance over all cross-cluster point pairs.
Centroid method
- Concept: combine clusters based on the distance between their centroids.
- Definition (distance between centroids): $D(C1, C2) = \left| \frac{1}{|C1|} \sum{x \in C1} x - \frac{1}{|C2|} \sum{y \in C2} y \right|$
- Interpretation: distance between cluster centroids, using the Euclidean norm; clusters merged by closest centroids.
Note: All of the above require a base distance measure D(x, y) between points (often Euclidean distance).

Data points (index: [x, y]):
- 0: [2.25, 6.48], 1: [5.49, 3.32], 2: [4.45, 7.65], 3: [7.66, 1.73], 4: [3.83, 2.89], 5: [0.94, 4.41], 6: [4.62, 4.28], 7: [1.06, 4.03]
Iteration 4: distance between Cluster [1, 6, 4] and Cluster [5, 7]
- Distances between all pairs (indices shown):
- dist(1, 5) = 4.68
- dist(1, 7) = 4.48
- dist(6, 5) = 3.69
- dist(6, 7) = 3.57
- dist(4, 5) = 3.27
- dist(4, 7) = 2.99
- Single linkage chooses the minimum distance: cluster distance = $2.99$
The base distance D(x, y) is the point-to-point distance used in the calculation above.

We stop merging when we have a predefined number of clusters k.
Alternatively, we may continue merging until there is only one cluster.
Practical note: In many analyses, a dendrogram is inspected to choose a reasonable k by cutting the tree at a chosen level.

Pros
- No need to predefine K (number of clusters) explicitly
- No assumption about cluster shape (can capture irregular shapes)
- Simple and intuitive
Cons
- The algorithm cannot undo a merge; early mistakes persist
- Different distance metrics can produce very different results
- Single linkage is sensitive to noise and outliers (can lead to chaining)
- Complete linkage tends to break large clusters (prefers compact clusters)
Notes on linkage behavior (illustrative effects)
- Single linkage: may create long chain-like clusters by linking through near neighbors; outliers can bridge clusters
- Complete linkage: tends to produce compact clusters and can fail to join large loosely connected groups

Notation used in the example:
- Clusters C1, C2, … represent sets of data point indices
- Distances D(x, y) are pairwise distances between points x and y
Real-world relevance:
- Hierarchical clustering provides a multi-scale view of data structure via the dendrogram
- Useful when the number of clusters is not known a priori; allows exploration at different levels of granularity

Dendrogram as a tree-structured representation of hierarchical clustering
Stepwise cluster evolution (8 data points) illustrating how clusters merge over iterations
Distance-based linkage criteria (Single, Complete, Average, Centroid)
Practical issues: stopping criteria, pros/cons, and sensitivity to noise/outliers