Data Mining - Clustering Basics and Partitioning Clustering

Clustering Basics and Partitioning Clustering

Introduction to Clustering

Clustering is a widely used technique in exploratory data analysis (EDA), alongside regression and classification.
It has been a subfield of statistics since the 1950s.
Its popularity has increased with the rise of big data.
It is now used in computer science and data science, not just statistics.

The Subjectivity of Clustering

Clustering is not a well-defined problem with a single solution.
Definitions of clustering vary; for example:
- Finding groups where objects within a group are similar and different from objects in other groups (Tan et al., 2006).
- Partitioning observations into distinct groups based on similarity within groups and dissimilarity between groups (James et al., 2013).
The terms "similar" and "related" are subjective, making the problem complex.
This subjectivity leads to many different clustering approaches and algorithms.
To solve a clustering task, the problem needs to be defined objectively.

Mathematical Perspectives on Clustering

Heterogeneity (separability): Inter-cluster distances are maximized.
Homogeneity (internal cohesion): Intra-cluster distances are minimized.
Another perspective:
- A cluster is a set of points in space where the distance between any two points in the cluster is less than the distance between a point in the cluster and a point outside it (Everitt, 1974).

Clustering in Practice

Clustering is a powerful tool for data pre-processing and exploration.
It is used in initial data analysis to understand data structures before applying other techniques.
It allows for data organization, visualization, and summarization.
Clustering can also be the final goal of analysis in various applications.

Applications of Clustering

Clustering is used in many fields, including:
- Biology, Health, and Medicine
- Psychology
- Business and Finance
- Computing Science
- Engineering

Application Examples

Marketing: Customer segmentation.
Bioinformatics: Grouping of genes, tissue types, or cell types.
Astronomy: Grouping of stars or galaxies.
Image Processing: Image or pixel segmentation.
Text Mining: Automatic document categorization.

Cluster Analysis Process

Data -> Data Representation -> Choice, Configuration, and Application of Clustering Tools -> Clustering Evaluation and Validation -> Cluster Interpretation -> Meaningful and Useful Findings (Domain Expert).
Note:
- This process is iterative (EDA).
- Clustering is subjective; attempts to make it objective rely on assumptions that may need revisiting.
- Each clustering algorithm makes the problem objective based on its own assumptions.

References (Part 1)

Jain, A. K. and Dubes, R. C., Algorithms for Clustering Data, Prentice Hall, 1988.
Everitt, B. S., Cluster Analysis, John Wiley and Sons, 1974.
Everitt, B. S., Landau, S., and Leese, M., Cluster Analysis, Arnold, 4th Edition, 2001.
Tan, P.-N., Steinbach, M., and Kumar, V., Introduction to Data Mining, Addison-Wesley, 2006.
P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, 2nd Edition, Pearson, 2018.
G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning with Applications in R, Springer, 2013.
Wu, X. and Kumar, V. (Editors), The Top Ten Algorithms in Data Mining, CRC Press, 2009.

Partitioning Clustering and k-Means

Part 2 focuses on partitioning clustering and the k-Means algorithm.

Data Partition

Given a dataset $X = {x1, x2, …, xN}$ consisting of $N$ observations, a hard (non-overlapping) clustering partition of the data is a collection $C = {C1, C2, …, Ck}$ of $k$ disjoint subsets, called clusters, such that:
- $C1 \cup C2 \cup … \cup C_k = X$
- $C_i \neq \emptyset \forall i$
- $Ci \cap Cj = \emptyset \forall i \neq j$
For example: $P = { (x1), (x3, x4, x6), (x2, x5) }$
One way to represent a clustering partition is a partition matrix.

Partition Matrix

A matrix with $k$ rows (number of clusters) and $N$ columns (number of observations) where $\mu{ij}$ represents the membership degree of the $j^{th}$ observation ( $xj$ ) to the $i^{th}$ cluster ( $C_i$ ).
In hard partitioning, each observation belongs to exactly one cluster:
- $\mu_{ij} \in {0,1}$
- $\sumi \mu{ij} = 1 \forall j$
The partition matrix is represented as:

UX =\begin{bmatrix} \mu{11} & \mu{12} & \dots & \mu{1N} \
\mu{21} & \mu{22} & \dots & \mu{2N} \ \vdots & \vdots & \ddots & \vdots \ \mu{k1} & \mu{k2} & \dots & \mu{kN}
\end{bmatrix}

Example:
- $P = { (x1), (x3, x4, x6), (x2, x5) }$

U_X = \begin{bmatrix}
1 & 0 & 0 & 0 & 0 & 0 \
0 & 1 & 0 & 0 & 1 & 0 \
0 & 0 & 1 & 1 & 0 & 1
\end{bmatrix}

Non-Overlapping Partitioning Clustering

Hard or Non-overlapping Partitioning clustering refers to methods that seek a hard partition of a dataset $X$ as a collection of objects/observations to be clustered.
Finding a Hard Partition Matrix $U(X)$ is equivalent to subdividing the set $X = {x1, x2, …, xN}$ with $N$ objects into a collection $C = {C1, C2, …, Ck}$ of $k$ disjoint clusters $Ci$ such that $C1 \cup C2 \cup … \cup Ck = X$ , $Ci \neq \emptyset$ , and $Ci \cap C_j = \emptyset \forall i \neq j$ .

K-Means

K-Means is a widely known clustering algorithm.
It minimizes within-cluster distances and maximizes between-cluster distances.
The classic version assumes quantitative observations and uses squared Euclidean distance.
Clusters are represented by their multivariate mean or centroid (spatial center).
K-Means is a prototype-based clustering algorithm.
It finds a non-overlapping partition of data into disjoint subsets (clusters).

Historical Context of K-Means

MacQueen (1967) is often cited as the original discoverer.
However, K-means was independently discovered by Steinhaus (1956), Lloyd (1957, published in 1982), Ball & Hall (1965), and MacQueen (1967) in different fields.

Practical Use of K-Means

Despite being developed in the 1950s/60s, it is still widely used because:
- It is simple and intuitive.
- It is computationally fast and can be optimized.
- It serves as a basis for more sophisticated algorithms.

Basic K-Means Algorithm

Choose $k$ initial cluster prototypes (using $k$ observations from data or random points).
Assign each observation to the closest prototype (using squared Euclidean distance).
Update cluster prototypes as centroids of the corresponding clusters.
Iterate steps 2 and 3 until no changes occur or until a stopping criterion is met (e.g. maximum iterations).

K-Means as an Optimization Problem

Objective function to be minimized:

SSE = Sum of Squared Errors (within-cluster variances)

$ J = \sum{c=1}^{k} \sum{xj \in Cc} d^2(xj, \bar{xc}) $

where

$d$ = Euclidean distance, and
$\bar{xc}$ is the prototype of cluster $Cc$ (with cardinality $|C_c|$ ).

K-Means Optimization Steps

The SSE can be rewritten using a partition matrix as:

$ J = \sum{c=1}^{k} \sum{j=1}^{N} \mu{cj} ||xj - x_c||^2 $

where

$ \mu{cj} \in {0,1}, \sum{c=1}^{k} \mu_{cj} = 1 \forall j $

We want to minimize $J$ with respect to ${xc}$ and ${ \mu{cj} }$ .
Steps:
- Fix ${xc}$ and minimize $J$ with respect to ${ \mu{cj} }$ (E-step: Expectation)
- Fix ${ \mu{cj} }$ and minimize $J$ with respect to ${xc}$ (M-step: Maximization)

E-step

Fix ${xc}$ and minimize $J$ w.r.t. ${ \mu{cj} }$ .
Minimize each summation component over $j$ independently.
Set $\mu{cj} = 1$ for $c$ that provides the smallest squared error $||xj – xc||^2$ , and $\mu{cj} = 0$ otherwise.
Assign $\mu_{cj} = 1$ for the cluster $c$ closest to the $j^{th}$ observation.

M-step

Minimize $J$ with respect to ${xc}$ , fixing ${ \mu{cj} }$ .
Take the gradient of $J$ with respect to each $xc$ and set it to zero to solve for $xc$ :

$ \nabla{xc} J = \sum{j=1}^{N} 2\mu{cj}(xj - xc) = 0 ,$

therefore,

$ xc = \frac{\sum{j=1}^{N} {\mu{cj}xj}}{\sum{j=1}^{N} \mu{cj}} = \frac{\sum{xj \in Cc} xj}{|C_c|} $

$xc$ is the centroid of cluster $Cc$ .

K-Means Limitations

K-Means is guaranteed to produce local optimal solutions, not global ones.
The final solution depends on initial prototypes.

Initialization Strategies

Multiple Random Initializations:
- Run k-Means multiple times with different random initial prototypes.
- Select the best k-Means result according to a criterion (e.g. SSE for fixed k).
- Works well in many problems.
- May require many runs for complex problems.
Improved Versions of Random Initializations:
- KMeans++: The first center is chosen randomly, then subsequent centers are chosen with probability proportional to the squared distance from the point's closest existing center.

Efficient Implementations for Big Data

Computational performance can be improved using:
- Specialized Data Structures (e.g. kd-trees).
- Smart Algorithms (e.g. Triangle Inequality).
- Recursive Update of Centroids:
  - New centroids depend on previous values, objects that changed clusters, and cluster sizes.
  - They don’t need to be recomputed from scratch each iteration.
- Parallelization (next).

Parallel / Distributed K-Means

Data are distributed across multiple data sites or processors.
Algorithm:
1. Distribute the same initial prototypes to every data site.
2. Each site runs one iteration of k-Means on its share of data in parallel.
3. Local prototypes and cluster sizes are communicated to a central node.
4. Global prototypes are computed (exactly) and retransmitted to all data sites.
5. Repeat for multiple k-Means iterations.

K-Means Summary

Pros:

Simple and Intuitive.
Linear computational complexity: $O(N \cdot n \cdot k)$ .
Effective in many scenarios, producing interpretable results.
Well-known and widely used.

Cons:

Choosing $k$ is difficult.
Sensitive to prototype initialization (local, suboptimal solutions).
Works best with volumetric/globular clusters.
Applicable to real-valued variables only.
Sensitive to outliers.

Hyper-Parameter Selection in k-Means

Multiple Runs within a Range of $k$ :
- Run k-Means repeatedly for different values of $k \in [k{min}, k{max}]$ and random initial prototypes.
- Select the best result (partition) according to some goodness of fit criterion.
- Pros: It can estimate $k$ while being less sensitive to local minima.
- Cons: Increased computational cost.

SSE as a Goodness-of-Fit Criterion

Can the algorithm’s objective function (SSE) be used to choose the best k-Means solution?
- Yes, IF they all have the same number of clusters (i.e., $k$ is fixed).
- However, if the number of clusters is unknown, the SSE tends to decrease monotonically as $k$ increases.

How to Determine k?

For a fixed $k$ , choose the solution with the minimum SSE from multiple runs.
For variable $k$ , the SSE decreases as $k$ increases.
Plot the SSE for each $k$ and look for a prominent "elbow" in the plot.

Clustering Validation

The procedure above works when clusters are well separated.
In practice, clustering is often more complex.
Clustering validation involves quantitative evaluation, model selection, and statistical validation of clustering results.

Silhouette Width Criterion (SWC)

SWC = Average of the Silhouette of Individual Observations:

Silhouette of the $i^{th}$ observation:

$ s(i) = \frac{b(i) - a(i)}{max{a(i), b(i)}} $

$a(i)$ : Average distance from the $i^{th}$ observation to the other observations in its cluster.
$b(i)$ : Average distance from the $i^{th}$ observation to the observations in the nearest neighboring cluster.

Simplified Version:

$a(i)$ is the distance from the $i^{th}$ observation to its cluster centroid.
$b(i)$ is the distance from the $i^{th}$ observation to the centroid of the nearest cluster (other than the one the $i^{th}$ observation belongs to).

SWC Range:

$ SWC \in [-1, +1] $

$\frac{1}{N} \sum_{i=1}^{N} s(i) \quad (s(i) := 0 \text{ for singletons})$

Simplified Silhouette Criterion (Illustration)

$ s(i) = \frac{b(i) - a(i)}{max{a(i), b(i)}} $

References (Part 2)

D. Steinley, K-Means Clustering: A Half-Century Synthesis, British J. of Mathematical and Stat. Psychology, V. 59, 2006.
A. K. Jain, Data Clustering: 50 Years Beyond K-Means, Pattern Recognition Letters, 2010.
P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, 2nd Edition, Pearson, 2018.
G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning with Applications in R, Springer, 2013.
G. L. Liu, Introduction to Combinatorial Mathematics, McGraw-Hill, 1968.The previous procedure works when clusters are well separated from each other. However, in practice, clustering is oftentimes not so trivial
There is a broad subarea of clustering, called clustering validation, that is related to quantitative evaluation, model selection, and statistical validation of clustering results
In the following we illustrate one basic procedure commonly used in practice