Data Mining - Clustering Basics and Partitioning Clustering
Clustering Basics and Partitioning Clustering
Introduction to Clustering
- Clustering is a widely used technique in exploratory data analysis (EDA), alongside regression and classification.
- It has been a subfield of statistics since the 1950s.
- Its popularity has increased with the rise of big data.
- It is now used in computer science and data science, not just statistics.
The Subjectivity of Clustering
- Clustering is not a well-defined problem with a single solution.
- Definitions of clustering vary; for example:
- Finding groups where objects within a group are similar and different from objects in other groups (Tan et al., 2006).
- Partitioning observations into distinct groups based on similarity within groups and dissimilarity between groups (James et al., 2013).
- The terms "similar" and "related" are subjective, making the problem complex.
- This subjectivity leads to many different clustering approaches and algorithms.
- To solve a clustering task, the problem needs to be defined objectively.
Mathematical Perspectives on Clustering
- Heterogeneity (separability): Inter-cluster distances are maximized.
- Homogeneity (internal cohesion): Intra-cluster distances are minimized.
- Another perspective:
- A cluster is a set of points in space where the distance between any two points in the cluster is less than the distance between a point in the cluster and a point outside it (Everitt, 1974).
Clustering in Practice
- Clustering is a powerful tool for data pre-processing and exploration.
- It is used in initial data analysis to understand data structures before applying other techniques.
- It allows for data organization, visualization, and summarization.
- Clustering can also be the final goal of analysis in various applications.
Applications of Clustering
- Clustering is used in many fields, including:
- Biology, Health, and Medicine
- Psychology
- Business and Finance
- Computing Science
- Engineering
Application Examples
- Marketing: Customer segmentation.
- Bioinformatics: Grouping of genes, tissue types, or cell types.
- Astronomy: Grouping of stars or galaxies.
- Image Processing: Image or pixel segmentation.
- Text Mining: Automatic document categorization.
Cluster Analysis Process
- Data -> Data Representation -> Choice, Configuration, and Application of Clustering Tools -> Clustering Evaluation and Validation -> Cluster Interpretation -> Meaningful and Useful Findings (Domain Expert).
- Note:
- This process is iterative (EDA).
- Clustering is subjective; attempts to make it objective rely on assumptions that may need revisiting.
- Each clustering algorithm makes the problem objective based on its own assumptions.
References (Part 1)
- Jain, A. K. and Dubes, R. C., Algorithms for Clustering Data, Prentice Hall, 1988.
- Everitt, B. S., Cluster Analysis, John Wiley and Sons, 1974.
- Everitt, B. S., Landau, S., and Leese, M., Cluster Analysis, Arnold, 4th Edition, 2001.
- Tan, P.-N., Steinbach, M., and Kumar, V., Introduction to Data Mining, Addison-Wesley, 2006.
- P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, 2nd Edition, Pearson, 2018.
- G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning with Applications in R, Springer, 2013.
- Wu, X. and Kumar, V. (Editors), The Top Ten Algorithms in Data Mining, CRC Press, 2009.
Partitioning Clustering and k-Means
- Part 2 focuses on partitioning clustering and the k-Means algorithm.
Data Partition
- Given a dataset consisting of observations, a hard (non-overlapping) clustering partition of the data is a collection of disjoint subsets, called clusters, such that:
- For example:
- One way to represent a clustering partition is a partition matrix.
Partition Matrix
- A matrix with rows (number of clusters) and columns (number of observations) where represents the membership degree of the observation () to the cluster ().
- In hard partitioning, each observation belongs to exactly one cluster:
- The partition matrix is represented as:
UX =\begin{bmatrix}
\mu{11} & \mu{12} & \dots & \mu{1N} \
\mu{21} & \mu{22} & \dots & \mu{2N} \
\vdots & \vdots & \ddots & \vdots \
\mu{k1} & \mu{k2} & \dots & \mu{kN}
\end{bmatrix}
- Example:
U_X = \begin{bmatrix}
1 & 0 & 0 & 0 & 0 & 0 \
0 & 1 & 0 & 0 & 1 & 0 \
0 & 0 & 1 & 1 & 0 & 1
\end{bmatrix}
Non-Overlapping Partitioning Clustering
- Hard or Non-overlapping Partitioning clustering refers to methods that seek a hard partition of a dataset as a collection of objects/observations to be clustered.
- Finding a Hard Partition Matrix is equivalent to subdividing the set with objects into a collection of disjoint clusters such that , , and .
K-Means
- K-Means is a widely known clustering algorithm.
- It minimizes within-cluster distances and maximizes between-cluster distances.
- The classic version assumes quantitative observations and uses squared Euclidean distance.
- Clusters are represented by their multivariate mean or centroid (spatial center).
- K-Means is a prototype-based clustering algorithm.
- It finds a non-overlapping partition of data into disjoint subsets (clusters).
Historical Context of K-Means
- MacQueen (1967) is often cited as the original discoverer.
- However, K-means was independently discovered by Steinhaus (1956), Lloyd (1957, published in 1982), Ball & Hall (1965), and MacQueen (1967) in different fields.
Practical Use of K-Means
- Despite being developed in the 1950s/60s, it is still widely used because:
- It is simple and intuitive.
- It is computationally fast and can be optimized.
- It serves as a basis for more sophisticated algorithms.
Basic K-Means Algorithm
- Choose initial cluster prototypes (using observations from data or random points).
- Assign each observation to the closest prototype (using squared Euclidean distance).
- Update cluster prototypes as centroids of the corresponding clusters.
- Iterate steps 2 and 3 until no changes occur or until a stopping criterion is met (e.g. maximum iterations).
K-Means as an Optimization Problem
- Objective function to be minimized:
SSE = Sum of Squared Errors (within-cluster variances)
where
- = Euclidean distance, and
- is the prototype of cluster (with cardinality ).
K-Means Optimization Steps
- The SSE can be rewritten using a partition matrix as:
where
- We want to minimize with respect to and .
- Steps:
- Fix and minimize with respect to (E-step: Expectation)
- Fix and minimize with respect to (M-step: Maximization)
E-step
- Fix and minimize w.r.t. .
- Minimize each summation component over independently.
- Set for that provides the smallest squared error , and otherwise.
- Assign for the cluster closest to the observation.
M-step
- Minimize with respect to , fixing .
- Take the gradient of with respect to each and set it to zero to solve for :
therefore,
- is the centroid of cluster .
K-Means Limitations
- K-Means is guaranteed to produce local optimal solutions, not global ones.
- The final solution depends on initial prototypes.
Initialization Strategies
- Multiple Random Initializations:
- Run k-Means multiple times with different random initial prototypes.
- Select the best k-Means result according to a criterion (e.g. SSE for fixed k).
- Works well in many problems.
- May require many runs for complex problems.
- Improved Versions of Random Initializations:
- KMeans++: The first center is chosen randomly, then subsequent centers are chosen with probability proportional to the squared distance from the point's closest existing center.
Efficient Implementations for Big Data
- Computational performance can be improved using:
- Specialized Data Structures (e.g. kd-trees).
- Smart Algorithms (e.g. Triangle Inequality).
- Recursive Update of Centroids:
- New centroids depend on previous values, objects that changed clusters, and cluster sizes.
- They don’t need to be recomputed from scratch each iteration.
- Parallelization (next).
Parallel / Distributed K-Means
- Data are distributed across multiple data sites or processors.
- Algorithm:
- Distribute the same initial prototypes to every data site.
- Each site runs one iteration of k-Means on its share of data in parallel.
- Local prototypes and cluster sizes are communicated to a central node.
- Global prototypes are computed (exactly) and retransmitted to all data sites.
- Repeat for multiple k-Means iterations.
K-Means Summary
Pros:
- Simple and Intuitive.
- Linear computational complexity: .
- Effective in many scenarios, producing interpretable results.
- Well-known and widely used.
Cons:
- Choosing is difficult.
- Sensitive to prototype initialization (local, suboptimal solutions).
- Works best with volumetric/globular clusters.
- Applicable to real-valued variables only.
- Sensitive to outliers.
Hyper-Parameter Selection in k-Means
- Multiple Runs within a Range of :
- Run k-Means repeatedly for different values of and random initial prototypes.
- Select the best result (partition) according to some goodness of fit criterion.
- Pros: It can estimate while being less sensitive to local minima.
- Cons: Increased computational cost.
SSE as a Goodness-of-Fit Criterion
- Can the algorithm’s objective function (SSE) be used to choose the best k-Means solution?
- Yes, IF they all have the same number of clusters (i.e., is fixed).
- However, if the number of clusters is unknown, the SSE tends to decrease monotonically as increases.
How to Determine k?
- For a fixed , choose the solution with the minimum SSE from multiple runs.
- For variable , the SSE decreases as increases.
- Plot the SSE for each and look for a prominent "elbow" in the plot.
Clustering Validation
- The procedure above works when clusters are well separated.
- In practice, clustering is often more complex.
- Clustering validation involves quantitative evaluation, model selection, and statistical validation of clustering results.
Silhouette Width Criterion (SWC)
- SWC = Average of the Silhouette of Individual Observations:
Silhouette of the observation:
- : Average distance from the observation to the other observations in its cluster.
- : Average distance from the observation to the observations in the nearest neighboring cluster.
Simplified Version:
- is the distance from the observation to its cluster centroid.
- is the distance from the observation to the centroid of the nearest cluster (other than the one the observation belongs to).
SWC Range:
Simplified Silhouette Criterion (Illustration)
References (Part 2)
- D. Steinley, K-Means Clustering: A Half-Century Synthesis, British J. of Mathematical and Stat. Psychology, V. 59, 2006.
- A. K. Jain, Data Clustering: 50 Years Beyond K-Means, Pattern Recognition Letters, 2010.
- P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, 2nd Edition, Pearson, 2018.
- G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning with Applications in R, Springer, 2013.
- G. L. Liu, Introduction to Combinatorial Mathematics, McGraw-Hill, 1968.The previous procedure works when clusters are well separated from each other. However, in practice, clustering is oftentimes not so trivial
- There is a broad subarea of clustering, called clustering validation, that is related to quantitative evaluation, model selection, and statistical validation of clustering results
- In the following we illustrate one basic procedure commonly used in practice