Clustering
Clustering groups similar objects into classes. Documents within a cluster should be similar, while documents from different clusters should be dissimilar. It is a form of unsupervised learning, using raw data without a priori classifications.
Applications of Clustering
Clustering has various applications, such as:
- Optimizing the placement of telephone towers.
- Strategic placement of patrol vans.
- Locating emergency-care wards.
Search Engine Uses
Search engines use clustering to:
- Improve search recall by grouping similar documents.
- Speed up vector space retrieval via a priori clustering.
- Create cleaner user interfaces.
- Automatically generate a thesaurus by clustering related terms.
Good Clustering
A good clustering exhibits:
- High intra-class similarity.
- Low inter-class similarity.
- Stability under growth and with small errors.
- Independence from initial object ordering.
Clustering vs. Classification
Clustering differs from classification:
- Clustering groups objects to find relationships.
- Classification assigns new objects to predefined classes.
- Clustering often precedes classification.
Clustering Steps
- Cluster the documents.
- Label the clusters.
- Determine decision boundaries for new documents.
Key Considerations for Clustering
- Document representation (usually vector space).
- Similarity/distance computation (e.g., cosine similarity).
- Determining the number of clusters.
Hard vs. Soft Clustering
- Hard clustering: Each document belongs to one cluster.
- Soft clustering: Documents can belong to multiple clusters.
Similarity/Distance Measures
- Cosine similarity: Measures the cosine of the angle between vectors (0 to 1).
- Euclidean distance: A popular alternative.
Clustering Algorithms
Two main methodologies:
- Partitioning-based.
- Hierarchical.
K-Means Clustering
- Choose k random data items as means.
- Iteratively refine:
- Associate each item to the nearest cluster.
- Recalculate cluster means.
Optimal K-Means
- Minimize intra-class variance.
- NP-hard problem; approximation algorithms are used.
- Assignment Step: Assign each observation to the cluster whose mean yields the least within-cluster sum of squares.
- Update step: Calculate the new means to be the centroids of the observations in the new clusters.
Approximation Algorithm
- Select K points as initial centroids.
- Repeat:
- Form K clusters by assigning each point to its closest centroid.
- Re-compute the centroid of each cluster.
- Until centroids do not change or M iterations reached.
Centroids
If x represents the vectors in a cluster c, then the centroid \mu(c) is the mean of the vectors:
\mu(c) = \frac{1}{|c|} \sum_{x \in c} x
Distance Metrics
- Euclidean distance (L2 norm): L(x, y) = \sqrt{\sum{i=1}^{m} (xi - y_i)^2}
- L1 norm: L1(x, y) = \sum{i=1}^{m} |xi - yi|
- Cosine Similarity: 1 - \frac{x \cdot y}{||x|| \cdot ||y||}
Adjustments to K-Means
- Multiple runs with different initial centroids.
- k-means++ algorithm (select distant points as initial centers).
- Termination conditions: fixed iterations, unchanged partition, or stable centroids.
Time Complexity of K-Means
O(iknm)
- m = vector size
- n = number of vectors
- k = number of clusters
- i = iterations
Evaluation Metrics for K-Means
- Inertia: Sum of distances of points within a cluster from its centroid.
- Dunn Index: Ratio of minimum inter-cluster distance to maximum intra-cluster distance.
K-Means Difficulties
- Varying cluster sizes and densities.
- Non-globular shapes.
- Outliers.
Choosing the Right Number of Clusters
Use the elbow method: plot inertia vs. number of clusters.
Evaluating K-Means Clusters
Sum of Squared Error (SSE):
SSE = \sum{i=1}^{k} \sum{x \in Ci} dist(x, mi)^2
Limitations of K-Means
- Problems with varying sizes, densities and non-globular shapes.
- Outliers.
Hierarchical Clustering
- Agglomerative (bottom-up): merge closest clusters.
- Divisive (top-down): split a cluster.
Agglomerative Clustering Algorithm
- Compute the distance matrix.
- Let each data point be a cluster.
- Repeat:
- Merge the two closest clusters.
- Update the distance matrix.
- Until a single cluster remains.
Distance Between Two Clusters
- Center of Gravity: Distance between centroids.
- Average Link: Average distance between all pairs of points.
- Single Link: Distance between the two closest points.
- Complete Link: Distance between the furthest points.
Dendrogram
A tree diagram illustrating the arrangement of clusters.
Hierarchical Agglomerative Clustering (HAC)
- Starts with unclustered data and forms larger clusters.
- Results in a hierarchy of clusters (dendrogram).
HAC Procedure
- Place each document in its own cluster.
- Compute pairwise document similarity.
- Form a new cluster by combining the most similar pair.
- Update the similarity matrix.
- Repeat until one cluster is left.
General HAC Algorithm and Complexity
- Compute similarity between all pairs of documents.
- Do N – 1 times
- Find closest pair of documents/clusters to merge.
- Update similarity of all documents/clusters to new cluster.
Divisive Clustering Algorithm
- Start with all documents in one cluster.
- Split the cluster using a partitioning algorithm (e.g., k-means).
- Apply recursively until each document is in its own cluster.
How to Label Clusters
- Show titles of typical documents.
- Show words/phrases prominent in the cluster.