Clustering Techniques

Clustering

  • Overview of clustering as a data analysis technique.

  • K-Means Clustering

  • Hierarchical Clustering

  • Course Information: ADM3308, Telfer School of Management, University of Ottawa

Outline ADM3308: Business Data Mining

  • Introduction and Definitions

  • Applications of Clustering

  • K-Means Clustering

  • Hierarchical Clustering

  • Evaluating Clusters

Classification vs. Clustering

Classification

  • Definition:

    • Supervised learning technique that uses labeled data.

    • Directed learning for predictive modeling.

    • Pre-defined Classes: Classifications are made based on predefined labels.

    • Predictive Model: Learns method to predict class of new samples using pre-labeled instances.

  • Visual Representation: Data points grouped into two pre-defined classes.

Clustering

  • Definition:

    • Unsupervised learning method utilizing un-labeled data.

    • Undirected learning approach aimed at descriptive modeling.

    • Groups records based on similarity of attributes, finding “natural” groupings within the data.

    • Characteristics: Records within a cluster are more similar to each other than to those in other clusters.

  • Visual Representation: Data points can be grouped into three clusters.

Example of Clustering Animals

  • Features Essential for Clustering: Proximity according to some defined distance or similarity measures.

  • Creating Clusters Depend on Factors:

    • Number of target clusters.

    • Common traits within the clusters.

  • Features Considered:

    • Number of legs, skin color, type of food, habitats, etc.

  • Example Clusters:

    • Cluster 1: Four-legged animals (cat, dog, horse, cow, fox, wolf, tiger, lion, zebra).

    • Cluster 2: Two-legged animals (dove, owl, hawk, eagle, hen, duck, goose).

    • Additional Clusters:

      • Cluster 3: Domestic animals used for food (hen, duck, goose, cow).

      • Cluster 4: Pets (cat, dog, horse).

      • Cluster 5: Wild birds (dove, owl, hawk, eagle).

      • Cluster 6: Wild mammals (fox, wolf, tiger, lion, zebra).

Introduction to Clustering

  • Characteristics of Clustering:

    • Clustering is unsupervised (un-directed).

    • No target variable exists, similarity of data defining the clusters.

    • Assist in understanding large datasets through:

      • Organizing data into smaller, more homogeneous pieces.

      • Reducing records by identifying prototypical examples.

      • Interpretation of clusters necessitates human insight.

Clustering Problem Definition

  • Problem Setup:

    • Given a database of tuples, $D = {t1, t2, …, t_n}$, and an integer $k$, define a mapping function $f: D \rightarrow {1, .., k}$.

    • Each tuple $ti$ is assigned to one cluster $Kj$, where $1 \leq j \leq k$.

    • Definition of a cluster $K_j$: Contains all tuples assigned to it.

    • Contrast to classification: Clusters are not known a priori.

    • For cluster $Ki = {t{i1}, t{i2}, …, t{im}}$, the cluster mean is given by:
      m<em>i=1m(t</em>i1++tim)m<em>i = \frac{1}{m} (t</em>{i1} + \ldots + t_{im})

Applications of Clustering

  • Customer Segmentation:

    • Identify groups of customers for targeted marketing strategies.

    • Example Applications:

      • CRM: Analyze customer data to define segments without specific questions guiding the analysis.

  • Clustering Products:

    • PRIZM system by Claritas Corporation for demographic segmentation.

    • Geographic examples can be analyzed (e.g., zip codes).

  • Public Utilities Example:

    • Data for 22 public utilities based on 8 measurements including costs and performance metrics to find similar utilities.

    • Variables Include:

      • Fixed-charge covering ratio, rate of return on capital, cost per kilowatt capacity, annual load factor, growth in peak demand, and nuclear fuel costs.

Clustering Example: Data from Public Utilities

  • Display of various utilities with corresponding metrics demonstrating clustering features:
    Company Data includes:

  • Arizona: Fixed charge = 1.06, RoR = 9.2, Cost = 151, Load = 54.4, Demand growth = 1.6, Sales = 9077, Nuclear = 0, Fuel Cost = 0.628.

  • Additional Data from Companies such as Boston, Central, Commonwealth, and various others showcasing diverse metrics.

  • Visual representation of clustering based on sales and fuel costs.

Additional Applications of Clustering

  • Other notable applications:

    • Grouping securities in investment portfolios.

    • Analyzing economic structures through company groupings.

    • Clustering user queries for uniformity.

    • Segmenting voters.

    • Web Mining: Identifying usage patterns on the internet.

    • Genomics: Grouping genes based on expression similarities.

    • Classifying species within biological studies.

    • Organizing the periodic table of elements.

Clustering Example: Astronomy

  • Specifically, the Hertzsprung-Russell diagram which clusters stars based on temperature and luminosity measurements.

  • Stars are categorized into groups: red giants, main sequence, and white dwarfs according to absolute magnitude and spectral class.

K-Means Clustering

  • One of the most commonly utilized clustering algorithms.

  • Characteristics of K-means:

    • Iterative process that assumes a geometric structure of data.

    • Records represented as points in n-dimensional space.

    • Requires specification of K (number of clusters).

    • Process begins with K clusters initialized arbitrarily and iteratively refines them.

Steps in K-Means Clustering

Step 1: Initialization

  • Randomly select k data points to act as initial centroids (seeds).

Step 2: Assignment

  • Assign each data record to the nearest centroid, creating initial clusters.

Step 3: Centroid Calculation

  • Compute new centroids for the clusters based on the mean of records within each cluster.

  • Repeat steps until centroids stabilize (no longer change).

  • Centroid Calculation Formula:
    m=1C<em>x</em>iCxim = \frac{1}{|C|} \sum<em>{x</em>i \in C} x_i where $C$ is the cluster.

K-Means Example

Step 1: Choosing Initial Centroids

  • Example with k = 3, picking initial centroids based on income and age data.

Step 2: Initial Cluster Assignment

  • Assign each point to the closest centroid following initial selections.

Step 3: Moving Centroids

  • Centroids are recalculated based on newly assigned clusters.

  • Reassign data points to corresponding clusters based on proximity to new centroids.

  • Process continues iteratively until centroids stop moving.

Interpreting Clusters

  • Unsupervised modeling necessitates a thorough understanding of each cluster’s characteristics and usefulness through evaluation criteria and summary statistics.

  • Assign meaningful labels to clusters for better comprehension.

  • Example of evaluating clusters in demographic strategy with PRIZM.

Evaluating Clusters

Types of Evaluation

Extrinsic Evaluation (Ground Truth Evaluation)
  • Homogeneity: Clusters should contain points primarily from one category.

  • Completeness: Must measure how many data points from the same category form the same cluster.

Intrinsic Evaluation
  • Evaluation done without relying on external ground truths.

    • Variance measurement

    • Dunn Index

    • Silhouette Score

Cluster Similarity Measurement

  • Measure within-cluster similarity using Sum of Squared Differences (SSD): SSD=<em>x</em>iC(xim)2SSD = \sum<em>{x</em>i \in C}(x_i - m)^2

    • Clusters with lower SSD are more desirable.

    • Variance also assessed relative to cluster size.

Dunn Index and Silhouette Score

  • Dunn Index computed as: Dunn=MINMAXDunn = \frac{MIN}{MAX}

    • Measure of the compactness and separation of clusters.

  • Silhouette Score evaluates the dissimilarity score and measures how well each point is clustered.

    • Silhouette Score Formula:
      Silht=crossdissAdissAcrossdissASilht = \frac{crossdissA - dissA}{crossdissA}

  • Silhouette scores range from 0 to 1 (occasionally from -1 to 1), with higher values indicating better clustering.

Hierarchical Clustering

  • Forms clusters in levels, yielding a visual representation of different sizes and shapes of clusters.

  • Two approaches:

Agglomerative Clustering
  • Begins with each item as a separate cluster, merging them iteratively based on similarity.

Divisive Clustering
  • Starts with one cluster then divides it into smaller clusters recursively.

Agglomerative Clustering Process

  1. Create a Similarity Matrix.

  2. Merge the two closest clusters based on the most similar elements.

  3. Update the similarity matrix after each merge operation.

  4. Continue until a single cluster remains.

  5. Each step generates potential clusterings.

Measuring Distance Between Clusters

  • Three distance measurement methods:

    • Centroid method.

    • Single linkage method.

    • Complete linkage method.

Dendrograms

  • Definition: A tree-like structure visualizing the results of hierarchical clustering.

  • Different levels exhibit the clusters formed at each stage.

  • Leaf nodes represent individual clusters; the root presents as a collective cluster.

Evaluating Clusters Summarized

  • Cluster evaluation is essential for determining their overall effectiveness.

  • Evaluation methods include both extrinsic and intrinsic measures to ensure robustness in findings.

References

  • Linoff, G. & Berry, M. (2011). Data Mining Techniques for Marketing, Sales, and Customer Relationship Management. John Wiley, 3rd Edition.

  • Shmueli, G. et al. (2023). Machine Learning for Business Analytics. John Wiley.