Clustering Techniques

Clustering

Overview of clustering as a data analysis technique.
K-Means Clustering
Hierarchical Clustering
Course Information: ADM3308, Telfer School of Management, University of Ottawa

Outline ADM3308: Business Data Mining

Introduction and Definitions
Applications of Clustering
K-Means Clustering
Hierarchical Clustering
Evaluating Clusters

Classification vs. Clustering

Classification

Definition:
- Supervised learning technique that uses labeled data.
- Directed learning for predictive modeling.
- Pre-defined Classes: Classifications are made based on predefined labels.
- Predictive Model: Learns method to predict class of new samples using pre-labeled instances.
Visual Representation: Data points grouped into two pre-defined classes.

Clustering

Definition:
- Unsupervised learning method utilizing un-labeled data.
- Undirected learning approach aimed at descriptive modeling.
- Groups records based on similarity of attributes, finding “natural” groupings within the data.
- Characteristics: Records within a cluster are more similar to each other than to those in other clusters.
Visual Representation: Data points can be grouped into three clusters.

Example of Clustering Animals

Features Essential for Clustering: Proximity according to some defined distance or similarity measures.
Creating Clusters Depend on Factors:
- Number of target clusters.
- Common traits within the clusters.
Features Considered:
- Number of legs, skin color, type of food, habitats, etc.
Example Clusters:
- Cluster 1: Four-legged animals (cat, dog, horse, cow, fox, wolf, tiger, lion, zebra).
- Cluster 2: Two-legged animals (dove, owl, hawk, eagle, hen, duck, goose).
- Additional Clusters:
  - Cluster 3: Domestic animals used for food (hen, duck, goose, cow).
  - Cluster 4: Pets (cat, dog, horse).
  - Cluster 5: Wild birds (dove, owl, hawk, eagle).
  - Cluster 6: Wild mammals (fox, wolf, tiger, lion, zebra).

Introduction to Clustering

Characteristics of Clustering:
- Clustering is unsupervised (un-directed).
- No target variable exists, similarity of data defining the clusters.
- Assist in understanding large datasets through:
  - Organizing data into smaller, more homogeneous pieces.
  - Reducing records by identifying prototypical examples.
  - Interpretation of clusters necessitates human insight.

Clustering Problem Definition

Problem Setup:
- Given a database of tuples, $D = {t1, t2, …, t_n}$, and an integer $k$, define a mapping function $f: D \rightarrow {1, .., k}$.
- Each tuple $ti$ is assigned to one cluster $Kj$, where $1 \leq j \leq k$.
- Definition of a cluster $K_j$: Contains all tuples assigned to it.
- Contrast to classification: Clusters are not known a priori.
- For cluster $Ki = {t{i1}, t{i2}, …, t{im}}$, the cluster mean is given by:
 $mi = \frac{1}{m} (t{i1} + \ldots + t_{im})$

Applications of Clustering

Customer Segmentation:
- Identify groups of customers for targeted marketing strategies.
- Example Applications:
  - CRM: Analyze customer data to define segments without specific questions guiding the analysis.
Clustering Products:
- PRIZM™ system by Claritas Corporation for demographic segmentation.
- Geographic examples can be analyzed (e.g., zip codes).
Public Utilities Example:
- Data for 22 public utilities based on 8 measurements including costs and performance metrics to find similar utilities.
- Variables Include:
  - Fixed-charge covering ratio, rate of return on capital, cost per kilowatt capacity, annual load factor, growth in peak demand, and nuclear fuel costs.

Clustering Example: Data from Public Utilities

Display of various utilities with corresponding metrics demonstrating clustering features:
Company Data includes:
Arizona: Fixed charge = 1.06, RoR = 9.2, Cost = 151, Load = 54.4, Demand growth = 1.6, Sales = 9077, Nuclear = 0, Fuel Cost = 0.628.
Additional Data from Companies such as Boston, Central, Commonwealth, and various others showcasing diverse metrics.
Visual representation of clustering based on sales and fuel costs.

Additional Applications of Clustering

Other notable applications:
- Grouping securities in investment portfolios.
- Analyzing economic structures through company groupings.
- Clustering user queries for uniformity.
- Segmenting voters.
- Web Mining: Identifying usage patterns on the internet.
- Genomics: Grouping genes based on expression similarities.
- Classifying species within biological studies.
- Organizing the periodic table of elements.

Clustering Example: Astronomy

Specifically, the Hertzsprung-Russell diagram which clusters stars based on temperature and luminosity measurements.
Stars are categorized into groups: red giants, main sequence, and white dwarfs according to absolute magnitude and spectral class.

K-Means Clustering

One of the most commonly utilized clustering algorithms.
Characteristics of K-means:
- Iterative process that assumes a geometric structure of data.
- Records represented as points in n-dimensional space.
- Requires specification of K (number of clusters).
- Process begins with K clusters initialized arbitrarily and iteratively refines them.

Steps in K-Means Clustering

Step 1: Initialization

Randomly select k data points to act as initial centroids (seeds).

Step 2: Assignment

Assign each data record to the nearest centroid, creating initial clusters.

Step 3: Centroid Calculation

Compute new centroids for the clusters based on the mean of records within each cluster.
Repeat steps until centroids stabilize (no longer change).
Centroid Calculation Formula:
$m = \frac{1}{|C|} \sum{xi \in C} x_i$ where $C$ is the cluster.

K-Means Example

Step 1: Choosing Initial Centroids

Example with k = 3, picking initial centroids based on income and age data.

Step 2: Initial Cluster Assignment

Assign each point to the closest centroid following initial selections.

Step 3: Moving Centroids

Centroids are recalculated based on newly assigned clusters.
Reassign data points to corresponding clusters based on proximity to new centroids.
Process continues iteratively until centroids stop moving.

Interpreting Clusters

Unsupervised modeling necessitates a thorough understanding of each cluster’s characteristics and usefulness through evaluation criteria and summary statistics.
Assign meaningful labels to clusters for better comprehension.
Example of evaluating clusters in demographic strategy with PRIZM.

Evaluating Clusters

Types of Evaluation

Extrinsic Evaluation (Ground Truth Evaluation)

Homogeneity: Clusters should contain points primarily from one category.
Completeness: Must measure how many data points from the same category form the same cluster.

Intrinsic Evaluation

Evaluation done without relying on external ground truths.
- Variance measurement
- Dunn Index
- Silhouette Score

Cluster Similarity Measurement

Measure within-cluster similarity using Sum of Squared Differences (SSD): $SSD = \sum{xi \in C}(x_i - m)^2$
- Clusters with lower SSD are more desirable.
- Variance also assessed relative to cluster size.

Dunn Index and Silhouette Score

Dunn Index computed as: $Dunn = \frac{MIN}{MAX}$
- Measure of the compactness and separation of clusters.
Silhouette Score evaluates the dissimilarity score and measures how well each point is clustered.
- Silhouette Score Formula:
  $Silht = \frac{crossdissA - dissA}{crossdissA}$
Silhouette scores range from 0 to 1 (occasionally from -1 to 1), with higher values indicating better clustering.

Hierarchical Clustering

Forms clusters in levels, yielding a visual representation of different sizes and shapes of clusters.
Two approaches:

Agglomerative Clustering

Begins with each item as a separate cluster, merging them iteratively based on similarity.

Divisive Clustering

Starts with one cluster then divides it into smaller clusters recursively.

Agglomerative Clustering Process

Create a Similarity Matrix.
Merge the two closest clusters based on the most similar elements.
Update the similarity matrix after each merge operation.
Continue until a single cluster remains.
Each step generates potential clusterings.

Measuring Distance Between Clusters

Three distance measurement methods:
- Centroid method.
- Single linkage method.
- Complete linkage method.

Dendrograms

Definition: A tree-like structure visualizing the results of hierarchical clustering.
Different levels exhibit the clusters formed at each stage.
Leaf nodes represent individual clusters; the root presents as a collective cluster.

Evaluating Clusters Summarized

Cluster evaluation is essential for determining their overall effectiveness.
Evaluation methods include both extrinsic and intrinsic measures to ensure robustness in findings.

References

Linoff, G. & Berry, M. (2011). Data Mining Techniques for Marketing, Sales, and Customer Relationship Management. John Wiley, 3rd Edition.
Shmueli, G. et al. (2023). Machine Learning for Business Analytics. John Wiley.