Clustering Methods

Clustering

  • Clustering aims to find natural groupings within data.
  • Samples within a group are more similar to each other than samples from different groups.
  • Expressed visually using dendrograms (tree diagrams).
  • Clustering imposes structure, which may not always be present in the data.

Applications of Clustering

  • Classification of species (taxonomy).
  • Classification of vegetation communities.
  • Classification of soil types.
  • Used to classify areas for sampling.
  • Less suitable for ecological communities where there are intermediate cases.

Clustering vs. Discriminant Function Analysis

  • Clustering:
    • Identifies groups without predefined categories.
    • An unsupervised method, using the data to define the groups.
  • Discriminant Function Analysis:
    • Predefines groups and determines the differences between them.
    • A supervised method.
    • Not covered but similar to principal components analysis with predefined groups.

Steps in Clustering Analysis

  1. Generate distance or dissimilarity matrices.
  2. Choose a clustering approach.

Types of Clustering Approaches

  • Agglomerative vs. Divisive.
    • Agglomerative: Builds up clusters by adding samples.
    • Divisive: Starts with one big group and divides it.
  • Hierarchical vs. Non-Hierarchical.
    • Hierarchical: Once a sample is in a group, it stays there.
    • Non-Hierarchical: Samples can switch groups based on an iterative measure.
  • Weighted: Emphasizes certain groupings.

Hierarchical Agglomerative Cluster Analysis

  • Starts with a pairwise similarity or dissimilarity matrix.
  • The most similar samples join first, then builds up.
  • Groups are combined until there is one large group.
  • Represented in a dendrogram.

Types of Linkage in Hierarchical Clustering

  • Single Linkage:
    • Uses the smallest dissimilarity.
    • Also known as the nearest neighbor method.
    • Produces cluster chains and elongated dendrograms.
  • Complete Linkage:
    • Uses the largest dissimilarity.
    • Sensitive to outliers.
  • Average Linkage:
    • Uses the group means.
    • Most common is the UPGMA (Unweighted Pair Group Method with Arithmetic Averages).

UPGMA (Unweighted Pair Group Method with Arithmetic Averages)

  • Uses averages of different linkages.
  • A step-by-step process is used for understanding.
  • Weighted pair group mean methods can also be used, weighting some of the differences.
  • Unweighted paired groups method uses centroids.
  • Ward's minimal variance: Forms clusters by minimizing the within-cluster sum of squares; similar sized clusters.
    • Uses Euclidean distance.
    • The others can use any type.

Example of UPGMA

  • Using a matrix with samples and species.
  • Bay Curtis similarities are produced.
  • Identify the highest similarity to form the first cluster.
  • Calculate the next cluster by averaging the differences.
  • Use proportional averaging to get the final clusters.

Non-Hierarchical Clustering

  • Uses iterative processes and randomization.
  • Samples are rearranged until the optimal cluster is achieved.
  • K-means clustering is a common method, where K is the number of clusters.
  • Membership is evaluated by defined criteria.
  • K-means is based on metric Euclidean data.

Determining the Number of Clusters (K)

  • Elbow Plot (Scree Plot):
    • Plots the weighted sum of squares.
    • Look for the elbow to determine the optimal number of clusters.
  • Calinsky-Harabasz Criterion:
    • KolinskyHarabasz = \frac{BetweenClusterVariance}{WithinClusterVariance}
    • Uses the ratio of between-cluster variance to within-cluster variance.
    • Higher values mean distinct and well-separated Clusters.
    • Optimal K is the peak on the Calinsky-Harabasz plot.
  • Silhouette Width:
    • Method compares how similar objects are within clusters compared to others.
  • Gap Statistic:
    • Method compares the total within variation for different clusters and compares that under a different reference distribution.

Applications and Limitations of Cluster Analysis

  • Useful for classifying things like soil samples, species, vegetation communities.
  • Less useful for environmental scientists and ecologists due to its rigid classification.
  • Ordinal methods are better for clinal data and gradients.