Note

0.0(0)

Chat with Kai

Explore Top Notes

Chapter 11: Literary Argument

Note

Studied by 40 people

4.5(2)

Unit 2 Biological approach to behaviour

Note

Studied by 26 people

4.5(2)

conscience and virtue

Note

Studied by 4 people

5.0(1)

The Sale of the Eiffel Tower

Chapter 11 - Communication

Note

Studied by 30 people

4.5(2)

Clustering Notes

Clustering

Clustering groups similar objects into classes. Documents within a cluster should be similar, while documents from different clusters should be dissimilar. It is a form of unsupervised learning, using raw data without a priori classifications.

Applications of Clustering

Clustering has various applications, such as:

Optimizing the placement of telephone towers.
Strategic placement of patrol vans.
Locating emergency-care wards.

Search Engine Uses

Search engines use clustering to:

Improve search recall by grouping similar documents.
Speed up vector space retrieval via a priori clustering.
Create cleaner user interfaces.
Automatically generate a thesaurus by clustering related terms.

Good Clustering

A good clustering exhibits:

High intra-class similarity.
Low inter-class similarity.
Stability under growth and with small errors.
Independence from initial object ordering.

Clustering vs. Classification

Clustering differs from classification:

Clustering groups objects to find relationships.
Classification assigns new objects to predefined classes.
Clustering often precedes classification.

Clustering Steps

Cluster the documents.
Label the clusters.
Determine decision boundaries for new documents.

Key Considerations for Clustering

Document representation (usually vector space).
Similarity/distance computation (e.g., cosine similarity).
Determining the number of clusters.

Hard vs. Soft Clustering

Hard clustering: Each document belongs to one cluster.
Soft clustering: Documents can belong to multiple clusters.

Similarity/Distance Measures

Cosine similarity: Measures the cosine of the angle between vectors (0 to 1).
Euclidean distance: A popular alternative.

Clustering Algorithms

Two main methodologies:

Partitioning-based.
Hierarchical.

K-Means Clustering

Choose k random data items as means.
Iteratively refine:
- Associate each item to the nearest cluster.
- Recalculate cluster means.

Optimal K-Means

Minimize intra-class variance.
NP-hard problem; approximation algorithms are used.

K-Means Algorithm Formulation

Assignment Step: Assign each observation to the cluster whose mean yields the least within-cluster sum of squares.
Update step: Calculate the new means to be the centroids of the observations in the new clusters.

Approximation Algorithm

Select K points as initial centroids.
Repeat:
- Form K clusters by assigning each point to its closest centroid.
- Re-compute the centroid of each cluster.
Until centroids do not change or M iterations reached.

Centroids

If x represents the vectors in a cluster c, then the centroid \mu(c) is the mean of the vectors:

\mu(c) = \frac{1}{|c|} \sum_{x \in c} x

Distance Metrics

Euclidean distance (L2 norm): L(x, y) = \sqrt{\sum{i=1}^{m} (xi - y_i)^2}
L1 norm: L1(x, y) = \sum{i=1}^{m} |xi - yi|
Cosine Similarity: 1 - \frac{x \cdot y}{||x|| \cdot ||y||}

Adjustments to K-Means

Multiple runs with different initial centroids.
k-means++ algorithm (select distant points as initial centers).
Termination conditions: fixed iterations, unchanged partition, or stable centroids.

Time Complexity of K-Means

O(iknm)

m = vector size
n = number of vectors
k = number of clusters
i = iterations

Evaluation Metrics for K-Means

Inertia: Sum of distances of points within a cluster from its centroid.
Dunn Index: Ratio of minimum inter-cluster distance to maximum intra-cluster distance.

K-Means Difficulties

Varying cluster sizes and densities.
Non-globular shapes.
Outliers.

Choosing the Right Number of Clusters

Use the elbow method: plot inertia vs. number of clusters.

Evaluating K-Means Clusters

Sum of Squared Error (SSE):

SSE = \sum{i=1}^{k} \sum{x \in Ci} dist(x, mi)^2

Limitations of K-Means

Problems with varying sizes, densities and non-globular shapes.
Outliers.

Hierarchical Clustering

Agglomerative (bottom-up): merge closest clusters.
Divisive (top-down): split a cluster.

Agglomerative Clustering Algorithm

Compute the distance matrix.
Let each data point be a cluster.
Repeat:
- Merge the two closest clusters.
- Update the distance matrix.
Until a single cluster remains.

Distance Between Two Clusters

Center of Gravity: Distance between centroids.
Average Link: Average distance between all pairs of points.
Single Link: Distance between the two closest points.
Complete Link: Distance between the furthest points.

Dendrogram

A tree diagram illustrating the arrangement of clusters.

Hierarchical Agglomerative Clustering (HAC)

Starts with unclustered data and forms larger clusters.
Results in a hierarchy of clusters (dendrogram).

HAC Procedure

Place each document in its own cluster.
Compute pairwise document similarity.
Form a new cluster by combining the most similar pair.
Update the similarity matrix.
Repeat until one cluster is left.

General HAC Algorithm and Complexity

Compute similarity between all pairs of documents.
Do N – 1 times
- Find closest pair of documents/clusters to merge.
- Update similarity of all documents/clusters to new cluster.