Data Mining - Clustering Basics and Partitioning Clustering

Clustering Basics and Partitioning Clustering

Introduction to Clustering

  • Clustering is a widely used technique in exploratory data analysis (EDA), alongside regression and classification.
  • It has been a subfield of statistics since the 1950s.
  • Its popularity has increased with the rise of big data.
  • It is now used in computer science and data science, not just statistics.

The Subjectivity of Clustering

  • Clustering is not a well-defined problem with a single solution.
  • Definitions of clustering vary; for example:
    • Finding groups where objects within a group are similar and different from objects in other groups (Tan et al., 2006).
    • Partitioning observations into distinct groups based on similarity within groups and dissimilarity between groups (James et al., 2013).
  • The terms "similar" and "related" are subjective, making the problem complex.
  • This subjectivity leads to many different clustering approaches and algorithms.
  • To solve a clustering task, the problem needs to be defined objectively.

Mathematical Perspectives on Clustering

  • Heterogeneity (separability): Inter-cluster distances are maximized.
  • Homogeneity (internal cohesion): Intra-cluster distances are minimized.
  • Another perspective:
    • A cluster is a set of points in space where the distance between any two points in the cluster is less than the distance between a point in the cluster and a point outside it (Everitt, 1974).

Clustering in Practice

  • Clustering is a powerful tool for data pre-processing and exploration.
  • It is used in initial data analysis to understand data structures before applying other techniques.
  • It allows for data organization, visualization, and summarization.
  • Clustering can also be the final goal of analysis in various applications.

Applications of Clustering

  • Clustering is used in many fields, including:
    • Biology, Health, and Medicine
    • Psychology
    • Business and Finance
    • Computing Science
    • Engineering

Application Examples

  • Marketing: Customer segmentation.
  • Bioinformatics: Grouping of genes, tissue types, or cell types.
  • Astronomy: Grouping of stars or galaxies.
  • Image Processing: Image or pixel segmentation.
  • Text Mining: Automatic document categorization.

Cluster Analysis Process

  • Data -> Data Representation -> Choice, Configuration, and Application of Clustering Tools -> Clustering Evaluation and Validation -> Cluster Interpretation -> Meaningful and Useful Findings (Domain Expert).
  • Note:
    • This process is iterative (EDA).
    • Clustering is subjective; attempts to make it objective rely on assumptions that may need revisiting.
    • Each clustering algorithm makes the problem objective based on its own assumptions.

References (Part 1)

  • Jain, A. K. and Dubes, R. C., Algorithms for Clustering Data, Prentice Hall, 1988.
  • Everitt, B. S., Cluster Analysis, John Wiley and Sons, 1974.
  • Everitt, B. S., Landau, S., and Leese, M., Cluster Analysis, Arnold, 4th Edition, 2001.
  • Tan, P.-N., Steinbach, M., and Kumar, V., Introduction to Data Mining, Addison-Wesley, 2006.
  • P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, 2nd Edition, Pearson, 2018.
  • G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning with Applications in R, Springer, 2013.
  • Wu, X. and Kumar, V. (Editors), The Top Ten Algorithms in Data Mining, CRC Press, 2009.

Partitioning Clustering and k-Means

  • Part 2 focuses on partitioning clustering and the k-Means algorithm.

Data Partition

  • Given a dataset X=x<em>1,x</em>2,,x<em>NX = {x<em>1, x</em>2, …, x<em>N} consisting of NN observations, a hard (non-overlapping) clustering partition of the data is a collection C=C</em>1,C<em>2,,C</em>kC = {C</em>1, C<em>2, …, C</em>k} of kk disjoint subsets, called clusters, such that:
    • C<em>1C</em>2Ck=XC<em>1 \cup C</em>2 \cup … \cup C_k = X
    • CiiC_i \neq \emptyset \forall i
    • C<em>iC</em>j=ijC<em>i \cap C</em>j = \emptyset \forall i \neq j
  • For example: P=(x<em>1),(x</em>3,x<em>4,x</em>6),(x<em>2,x</em>5)P = { (x<em>1), (x</em>3, x<em>4, x</em>6), (x<em>2, x</em>5) }
  • One way to represent a clustering partition is a partition matrix.

Partition Matrix

  • A matrix with kk rows (number of clusters) and NN columns (number of observations) where μ<em>ij\mu<em>{ij} represents the membership degree of the jthj^{th} observation (x</em>jx</em>j) to the ithi^{th} cluster (CiC_i).
  • In hard partitioning, each observation belongs to exactly one cluster:
    • μij0,1\mu_{ij} \in {0,1}
    • <em>iμ</em>ij=1j\sum<em>i \mu</em>{ij} = 1 \forall j
  • The partition matrix is represented as:


UX =\begin{bmatrix} \mu{11} & \mu{12} & \dots & \mu{1N} \
\mu{21} & \mu{22} & \dots & \mu{2N} \ \vdots & \vdots & \ddots & \vdots \ \mu{k1} & \mu{k2} & \dots & \mu{kN}
\end{bmatrix}

  • Example:
    • P=(x<em>1),(x</em>3,x<em>4,x</em>6),(x<em>2,x</em>5)P = { (x<em>1), (x</em>3, x<em>4, x</em>6), (x<em>2, x</em>5) }


U_X = \begin{bmatrix}
1 & 0 & 0 & 0 & 0 & 0 \
0 & 1 & 0 & 0 & 1 & 0 \
0 & 0 & 1 & 1 & 0 & 1
\end{bmatrix}

Non-Overlapping Partitioning Clustering

  • Hard or Non-overlapping Partitioning clustering refers to methods that seek a hard partition of a dataset XX as a collection of objects/observations to be clustered.
  • Finding a Hard Partition Matrix U(X)U(X) is equivalent to subdividing the set X=x<em>1,x</em>2,,x<em>NX = {x<em>1, x</em>2, …, x<em>N} with NN objects into a collection C=C</em>1,C<em>2,,C</em>kC = {C</em>1, C<em>2, …, C</em>k} of kk disjoint clusters C<em>iC<em>i such that C</em>1C<em>2C</em>k=XC</em>1 \cup C<em>2 \cup … \cup C</em>k = X, C<em>iC<em>i \neq \emptyset, and C</em>iCj=ijC</em>i \cap C_j = \emptyset \forall i \neq j.

K-Means

  • K-Means is a widely known clustering algorithm.
  • It minimizes within-cluster distances and maximizes between-cluster distances.
  • The classic version assumes quantitative observations and uses squared Euclidean distance.
  • Clusters are represented by their multivariate mean or centroid (spatial center).
  • K-Means is a prototype-based clustering algorithm.
  • It finds a non-overlapping partition of data into disjoint subsets (clusters).

Historical Context of K-Means

  • MacQueen (1967) is often cited as the original discoverer.
  • However, K-means was independently discovered by Steinhaus (1956), Lloyd (1957, published in 1982), Ball & Hall (1965), and MacQueen (1967) in different fields.

Practical Use of K-Means

  • Despite being developed in the 1950s/60s, it is still widely used because:
    • It is simple and intuitive.
    • It is computationally fast and can be optimized.
    • It serves as a basis for more sophisticated algorithms.

Basic K-Means Algorithm

  1. Choose kk initial cluster prototypes (using kk observations from data or random points).
  2. Assign each observation to the closest prototype (using squared Euclidean distance).
  3. Update cluster prototypes as centroids of the corresponding clusters.
  4. Iterate steps 2 and 3 until no changes occur or until a stopping criterion is met (e.g. maximum iterations).

K-Means as an Optimization Problem

  • Objective function to be minimized:

SSE = Sum of Squared Errors (within-cluster variances)

<br/>J=<em>c=1k</em>x<em>jC</em>cd2(x<em>j,x</em>cˉ)<br/><br /> J = \sum<em>{c=1}^{k} \sum</em>{x<em>j \in C</em>c} d^2(x<em>j, \bar{x</em>c})<br />

where

  • dd = Euclidean distance, and
  • x<em>cˉ\bar{x<em>c} is the prototype of cluster C</em>cC</em>c (with cardinality Cc|C_c|).

K-Means Optimization Steps

  • The SSE can be rewritten using a partition matrix as:

<br/>J=<em>c=1k</em>j=1Nμ<em>cjx</em>jxc2<br/><br /> J = \sum<em>{c=1}^{k} \sum</em>{j=1}^{N} \mu<em>{cj} ||x</em>j - x_c||^2<br />

where

<br/>μ<em>cj0,1,</em>c=1kμcj=1j<br/><br /> \mu<em>{cj} \in {0,1}, \sum</em>{c=1}^{k} \mu_{cj} = 1 \forall j<br />

  • We want to minimize JJ with respect to x<em>c{x<em>c} and μ</em>cj{ \mu</em>{cj} }.
  • Steps:
    • Fix x<em>c{x<em>c} and minimize JJ with respect to μ</em>cj{ \mu</em>{cj} } (E-step: Expectation)
    • Fix μ<em>cj{ \mu<em>{cj} } and minimize JJ with respect to x</em>c{x</em>c} (M-step: Maximization)
E-step
  • Fix x<em>c{x<em>c} and minimize JJ w.r.t. μ</em>cj{ \mu</em>{cj} }.
  • Minimize each summation component over jj independently.
  • Set μ<em>cj=1\mu<em>{cj} = 1 for cc that provides the smallest squared error x</em>jx<em>c2||x</em>j – x<em>c||^2, and μ</em>cj=0\mu</em>{cj} = 0 otherwise.
  • Assign μcj=1\mu_{cj} = 1 for the cluster cc closest to the jthj^{th} observation.
M-step
  • Minimize JJ with respect to x<em>c{x<em>c}, fixing μ</em>cj{ \mu</em>{cj} }.
  • Take the gradient of JJ with respect to each x<em>cx<em>c and set it to zero to solve for x</em>cx</em>c:

<br/><em>x</em>cJ=<em>j=1N2μ</em>cj(x<em>jx</em>c)=0<br/>,<br /> \nabla<em>{x</em>c} J = \sum<em>{j=1}^{N} 2\mu</em>{cj}(x<em>j - x</em>c) = 0<br /> ,

therefore,

<br/>x<em>c=</em>j=1Nμ<em>cjx</em>j<em>j=1Nμ</em>cj=<em>x</em>jC<em>cx</em>jCc<br/><br /> x<em>c = \frac{\sum</em>{j=1}^{N} {\mu<em>{cj}x</em>j}}{\sum<em>{j=1}^{N} \mu</em>{cj}} = \frac{\sum<em>{x</em>j \in C<em>c} x</em>j}{|C_c|}<br />

  • x<em>cx<em>c is the centroid of cluster C</em>cC</em>c.

K-Means Limitations

  • K-Means is guaranteed to produce local optimal solutions, not global ones.
  • The final solution depends on initial prototypes.

Initialization Strategies

  • Multiple Random Initializations:
    • Run k-Means multiple times with different random initial prototypes.
    • Select the best k-Means result according to a criterion (e.g. SSE for fixed k).
    • Works well in many problems.
    • May require many runs for complex problems.
  • Improved Versions of Random Initializations:
    • KMeans++: The first center is chosen randomly, then subsequent centers are chosen with probability proportional to the squared distance from the point's closest existing center.

Efficient Implementations for Big Data

  • Computational performance can be improved using:
    • Specialized Data Structures (e.g. kd-trees).
    • Smart Algorithms (e.g. Triangle Inequality).
    • Recursive Update of Centroids:
      • New centroids depend on previous values, objects that changed clusters, and cluster sizes.
      • They don’t need to be recomputed from scratch each iteration.
    • Parallelization (next).

Parallel / Distributed K-Means

  • Data are distributed across multiple data sites or processors.
  • Algorithm:
    1. Distribute the same initial prototypes to every data site.
    2. Each site runs one iteration of k-Means on its share of data in parallel.
    3. Local prototypes and cluster sizes are communicated to a central node.
    4. Global prototypes are computed (exactly) and retransmitted to all data sites.
    5. Repeat for multiple k-Means iterations.

K-Means Summary

Pros:
  • Simple and Intuitive.
  • Linear computational complexity: O(Nnk)O(N \cdot n \cdot k).
  • Effective in many scenarios, producing interpretable results.
  • Well-known and widely used.
Cons:
  • Choosing kk is difficult.
  • Sensitive to prototype initialization (local, suboptimal solutions).
  • Works best with volumetric/globular clusters.
  • Applicable to real-valued variables only.
  • Sensitive to outliers.

Hyper-Parameter Selection in k-Means

  • Multiple Runs within a Range of kk:
    • Run k-Means repeatedly for different values of k[k<em>min,k</em>max]k \in [k<em>{min}, k</em>{max}] and random initial prototypes.
    • Select the best result (partition) according to some goodness of fit criterion.
    • Pros: It can estimate kk while being less sensitive to local minima.
    • Cons: Increased computational cost.

SSE as a Goodness-of-Fit Criterion

  • Can the algorithm’s objective function (SSE) be used to choose the best k-Means solution?
    • Yes, IF they all have the same number of clusters (i.e., kk is fixed).
    • However, if the number of clusters is unknown, the SSE tends to decrease monotonically as kk increases.

How to Determine k?

  • For a fixed kk, choose the solution with the minimum SSE from multiple runs.
  • For variable kk, the SSE decreases as kk increases.
  • Plot the SSE for each kk and look for a prominent "elbow" in the plot.

Clustering Validation

  • The procedure above works when clusters are well separated.
  • In practice, clustering is often more complex.
  • Clustering validation involves quantitative evaluation, model selection, and statistical validation of clustering results.

Silhouette Width Criterion (SWC)

  • SWC = Average of the Silhouette of Individual Observations:

Silhouette of the ithi^{th} observation:

<br/>s(i)=b(i)a(i)maxa(i),b(i)<br/><br /> s(i) = \frac{b(i) - a(i)}{max{a(i), b(i)}}<br />

  • a(i)a(i): Average distance from the ithi^{th} observation to the other observations in its cluster.
  • b(i)b(i): Average distance from the ithi^{th} observation to the observations in the nearest neighboring cluster.

Simplified Version:

  • a(i)a(i) is the distance from the ithi^{th} observation to its cluster centroid.
  • b(i)b(i) is the distance from the ithi^{th} observation to the centroid of the nearest cluster (other than the one the ithi^{th} observation belongs to).

SWC Range:

<br/>SWC[1,+1]<br/><br /> SWC \in [-1, +1]<br />

1Ni=1Ns(i)(s(i):=0 for singletons)\frac{1}{N} \sum_{i=1}^{N} s(i) \quad (s(i) := 0 \text{ for singletons})

Simplified Silhouette Criterion (Illustration)

<br/>s(i)=b(i)a(i)maxa(i),b(i)<br/><br /> s(i) = \frac{b(i) - a(i)}{max{a(i), b(i)}}<br />

References (Part 2)

  • D. Steinley, K-Means Clustering: A Half-Century Synthesis, British J. of Mathematical and Stat. Psychology, V. 59, 2006.
  • A. K. Jain, Data Clustering: 50 Years Beyond K-Means, Pattern Recognition Letters, 2010.
  • P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, 2nd Edition, Pearson, 2018.
  • G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning with Applications in R, Springer, 2013.
  • G. L. Liu, Introduction to Combinatorial Mathematics, McGraw-Hill, 1968.The previous procedure works when clusters are well separated from each other. However, in practice, clustering is oftentimes not so trivial
  • There is a broad subarea of clustering, called clustering validation, that is related to quantitative evaluation, model selection, and statistical validation of clustering results
  • In the following we illustrate one basic procedure commonly used in practice