Clustering Techniques
Clustering
Overview of clustering as a data analysis technique.
K-Means Clustering
Hierarchical Clustering
Course Information: ADM3308, Telfer School of Management, University of Ottawa
Outline ADM3308: Business Data Mining
Introduction and Definitions
Applications of Clustering
K-Means Clustering
Hierarchical Clustering
Evaluating Clusters
Classification vs. Clustering
Classification
Definition:
Supervised learning technique that uses labeled data.
Directed learning for predictive modeling.
Pre-defined Classes: Classifications are made based on predefined labels.
Predictive Model: Learns method to predict class of new samples using pre-labeled instances.
Visual Representation: Data points grouped into two pre-defined classes.
Clustering
Definition:
Unsupervised learning method utilizing un-labeled data.
Undirected learning approach aimed at descriptive modeling.
Groups records based on similarity of attributes, finding “natural” groupings within the data.
Characteristics: Records within a cluster are more similar to each other than to those in other clusters.
Visual Representation: Data points can be grouped into three clusters.
Example of Clustering Animals
Features Essential for Clustering: Proximity according to some defined distance or similarity measures.
Creating Clusters Depend on Factors:
Number of target clusters.
Common traits within the clusters.
Features Considered:
Number of legs, skin color, type of food, habitats, etc.
Example Clusters:
Cluster 1: Four-legged animals (cat, dog, horse, cow, fox, wolf, tiger, lion, zebra).
Cluster 2: Two-legged animals (dove, owl, hawk, eagle, hen, duck, goose).
Additional Clusters:
Cluster 3: Domestic animals used for food (hen, duck, goose, cow).
Cluster 4: Pets (cat, dog, horse).
Cluster 5: Wild birds (dove, owl, hawk, eagle).
Cluster 6: Wild mammals (fox, wolf, tiger, lion, zebra).
Introduction to Clustering
Characteristics of Clustering:
Clustering is unsupervised (un-directed).
No target variable exists, similarity of data defining the clusters.
Assist in understanding large datasets through:
Organizing data into smaller, more homogeneous pieces.
Reducing records by identifying prototypical examples.
Interpretation of clusters necessitates human insight.
Clustering Problem Definition
Problem Setup:
Given a database of tuples, $D = {t1, t2, …, t_n}$, and an integer $k$, define a mapping function $f: D \rightarrow {1, .., k}$.
Each tuple $ti$ is assigned to one cluster $Kj$, where $1 \leq j \leq k$.
Definition of a cluster $K_j$: Contains all tuples assigned to it.
Contrast to classification: Clusters are not known a priori.
For cluster $Ki = {t{i1}, t{i2}, …, t{im}}$, the cluster mean is given by:
Applications of Clustering
Customer Segmentation:
Identify groups of customers for targeted marketing strategies.
Example Applications:
CRM: Analyze customer data to define segments without specific questions guiding the analysis.
Clustering Products:
PRIZM™ system by Claritas Corporation for demographic segmentation.
Geographic examples can be analyzed (e.g., zip codes).
Public Utilities Example:
Data for 22 public utilities based on 8 measurements including costs and performance metrics to find similar utilities.
Variables Include:
Fixed-charge covering ratio, rate of return on capital, cost per kilowatt capacity, annual load factor, growth in peak demand, and nuclear fuel costs.
Clustering Example: Data from Public Utilities
Display of various utilities with corresponding metrics demonstrating clustering features:
Company Data includes:Arizona: Fixed charge = 1.06, RoR = 9.2, Cost = 151, Load = 54.4, Demand growth = 1.6, Sales = 9077, Nuclear = 0, Fuel Cost = 0.628.
Additional Data from Companies such as Boston, Central, Commonwealth, and various others showcasing diverse metrics.
Visual representation of clustering based on sales and fuel costs.
Additional Applications of Clustering
Other notable applications:
Grouping securities in investment portfolios.
Analyzing economic structures through company groupings.
Clustering user queries for uniformity.
Segmenting voters.
Web Mining: Identifying usage patterns on the internet.
Genomics: Grouping genes based on expression similarities.
Classifying species within biological studies.
Organizing the periodic table of elements.
Clustering Example: Astronomy
Specifically, the Hertzsprung-Russell diagram which clusters stars based on temperature and luminosity measurements.
Stars are categorized into groups: red giants, main sequence, and white dwarfs according to absolute magnitude and spectral class.
K-Means Clustering
One of the most commonly utilized clustering algorithms.
Characteristics of K-means:
Iterative process that assumes a geometric structure of data.
Records represented as points in n-dimensional space.
Requires specification of K (number of clusters).
Process begins with K clusters initialized arbitrarily and iteratively refines them.
Steps in K-Means Clustering
Step 1: Initialization
Randomly select k data points to act as initial centroids (seeds).
Step 2: Assignment
Assign each data record to the nearest centroid, creating initial clusters.
Step 3: Centroid Calculation
Compute new centroids for the clusters based on the mean of records within each cluster.
Repeat steps until centroids stabilize (no longer change).
Centroid Calculation Formula:
where $C$ is the cluster.
K-Means Example
Step 1: Choosing Initial Centroids
Example with k = 3, picking initial centroids based on income and age data.
Step 2: Initial Cluster Assignment
Assign each point to the closest centroid following initial selections.
Step 3: Moving Centroids
Centroids are recalculated based on newly assigned clusters.
Reassign data points to corresponding clusters based on proximity to new centroids.
Process continues iteratively until centroids stop moving.
Interpreting Clusters
Unsupervised modeling necessitates a thorough understanding of each cluster’s characteristics and usefulness through evaluation criteria and summary statistics.
Assign meaningful labels to clusters for better comprehension.
Example of evaluating clusters in demographic strategy with PRIZM.
Evaluating Clusters
Types of Evaluation
Extrinsic Evaluation (Ground Truth Evaluation)
Homogeneity: Clusters should contain points primarily from one category.
Completeness: Must measure how many data points from the same category form the same cluster.
Intrinsic Evaluation
Evaluation done without relying on external ground truths.
Variance measurement
Dunn Index
Silhouette Score
Cluster Similarity Measurement
Measure within-cluster similarity using Sum of Squared Differences (SSD):
Clusters with lower SSD are more desirable.
Variance also assessed relative to cluster size.
Dunn Index and Silhouette Score
Dunn Index computed as:
Measure of the compactness and separation of clusters.
Silhouette Score evaluates the dissimilarity score and measures how well each point is clustered.
Silhouette Score Formula:
Silhouette scores range from 0 to 1 (occasionally from -1 to 1), with higher values indicating better clustering.
Hierarchical Clustering
Forms clusters in levels, yielding a visual representation of different sizes and shapes of clusters.
Two approaches:
Agglomerative Clustering
Begins with each item as a separate cluster, merging them iteratively based on similarity.
Divisive Clustering
Starts with one cluster then divides it into smaller clusters recursively.
Agglomerative Clustering Process
Create a Similarity Matrix.
Merge the two closest clusters based on the most similar elements.
Update the similarity matrix after each merge operation.
Continue until a single cluster remains.
Each step generates potential clusterings.
Measuring Distance Between Clusters
Three distance measurement methods:
Centroid method.
Single linkage method.
Complete linkage method.
Dendrograms
Definition: A tree-like structure visualizing the results of hierarchical clustering.
Different levels exhibit the clusters formed at each stage.
Leaf nodes represent individual clusters; the root presents as a collective cluster.
Evaluating Clusters Summarized
Cluster evaluation is essential for determining their overall effectiveness.
Evaluation methods include both extrinsic and intrinsic measures to ensure robustness in findings.
References
Linoff, G. & Berry, M. (2011). Data Mining Techniques for Marketing, Sales, and Customer Relationship Management. John Wiley, 3rd Edition.
Shmueli, G. et al. (2023). Machine Learning for Business Analytics. John Wiley.