MR

Clustering Notes

Clustering

Clustering groups similar objects into classes. Documents within a cluster should be similar, while documents from different clusters should be dissimilar. It is a form of unsupervised learning, using raw data without a priori classifications.

Applications of Clustering

Clustering has various applications, such as:

  1. Optimizing the placement of telephone towers.
  2. Strategic placement of patrol vans.
  3. Locating emergency-care wards.

Search Engine Uses

Search engines use clustering to:

  1. Improve search recall by grouping similar documents.
  2. Speed up vector space retrieval via a priori clustering.
  3. Create cleaner user interfaces.
  4. Automatically generate a thesaurus by clustering related terms.

Good Clustering

A good clustering exhibits:

  • High intra-class similarity.
  • Low inter-class similarity.
  • Stability under growth and with small errors.
  • Independence from initial object ordering.

Clustering vs. Classification

Clustering differs from classification:

  • Clustering groups objects to find relationships.
  • Classification assigns new objects to predefined classes.
  • Clustering often precedes classification.

Clustering Steps

  1. Cluster the documents.
  2. Label the clusters.
  3. Determine decision boundaries for new documents.

Key Considerations for Clustering

  • Document representation (usually vector space).
  • Similarity/distance computation (e.g., cosine similarity).
  • Determining the number of clusters.

Hard vs. Soft Clustering

  • Hard clustering: Each document belongs to one cluster.
  • Soft clustering: Documents can belong to multiple clusters.

Similarity/Distance Measures

  • Cosine similarity: Measures the cosine of the angle between vectors (0 to 1).
  • Euclidean distance: A popular alternative.

Clustering Algorithms

Two main methodologies:

  • Partitioning-based.
  • Hierarchical.

K-Means Clustering

  • Choose k random data items as means.
  • Iteratively refine:
    • Associate each item to the nearest cluster.
    • Recalculate cluster means.

Optimal K-Means

  • Minimize intra-class variance.
  • NP-hard problem; approximation algorithms are used.

K-Means Algorithm Formulation

  1. Assignment Step: Assign each observation to the cluster whose mean yields the least within-cluster sum of squares.
  2. Update step: Calculate the new means to be the centroids of the observations in the new clusters.

Approximation Algorithm

  1. Select K points as initial centroids.
  2. Repeat:
    • Form K clusters by assigning each point to its closest centroid.
    • Re-compute the centroid of each cluster.
  3. Until centroids do not change or M iterations reached.

Centroids

If x represents the vectors in a cluster c, then the centroid \mu(c) is the mean of the vectors:

\mu(c) = \frac{1}{|c|} \sum_{x \in c} x

Distance Metrics

  • Euclidean distance (L2 norm): L(x, y) = \sqrt{\sum{i=1}^{m} (xi - y_i)^2}
  • L1 norm: L1(x, y) = \sum{i=1}^{m} |xi - yi|
  • Cosine Similarity: 1 - \frac{x \cdot y}{||x|| \cdot ||y||}

Adjustments to K-Means

  • Multiple runs with different initial centroids.
  • k-means++ algorithm (select distant points as initial centers).
  • Termination conditions: fixed iterations, unchanged partition, or stable centroids.

Time Complexity of K-Means

O(iknm)

  • m = vector size
  • n = number of vectors
  • k = number of clusters
  • i = iterations

Evaluation Metrics for K-Means

  • Inertia: Sum of distances of points within a cluster from its centroid.
  • Dunn Index: Ratio of minimum inter-cluster distance to maximum intra-cluster distance.

K-Means Difficulties

  • Varying cluster sizes and densities.
  • Non-globular shapes.
  • Outliers.

Choosing the Right Number of Clusters

Use the elbow method: plot inertia vs. number of clusters.

Evaluating K-Means Clusters

Sum of Squared Error (SSE):

SSE = \sum{i=1}^{k} \sum{x \in Ci} dist(x, mi)^2

Limitations of K-Means

  • Problems with varying sizes, densities and non-globular shapes.
  • Outliers.

Hierarchical Clustering

  • Agglomerative (bottom-up): merge closest clusters.
  • Divisive (top-down): split a cluster.

Agglomerative Clustering Algorithm

  1. Compute the distance matrix.
  2. Let each data point be a cluster.
  3. Repeat:
    • Merge the two closest clusters.
    • Update the distance matrix.
  4. Until a single cluster remains.

Distance Between Two Clusters

  1. Center of Gravity: Distance between centroids.
  2. Average Link: Average distance between all pairs of points.
  3. Single Link: Distance between the two closest points.
  4. Complete Link: Distance between the furthest points.

Dendrogram

A tree diagram illustrating the arrangement of clusters.

Hierarchical Agglomerative Clustering (HAC)

  • Starts with unclustered data and forms larger clusters.
  • Results in a hierarchy of clusters (dendrogram).

HAC Procedure

  1. Place each document in its own cluster.
  2. Compute pairwise document similarity.
  3. Form a new cluster by combining the most similar pair.
  4. Update the similarity matrix.
  5. Repeat until one cluster is left.

General HAC Algorithm and Complexity

  1. Compute similarity between all pairs of documents.
  2. Do N – 1 times
    • Find closest pair of documents/clusters to merge.
    • Update similarity of all documents/clusters to new cluster.

Divisive Clustering Algorithm

  1. Start with all documents in one cluster.
  2. Split the cluster using a partitioning algorithm (e.g., k-means).
  3. Apply recursively until each document is in its own cluster.

How to Label Clusters

  1. Show titles of typical documents.
  2. Show words/phrases prominent in the cluster.