Clustering Methods
Clustering
- Clustering aims to find natural groupings within data.
- Samples within a group are more similar to each other than samples from different groups.
- Expressed visually using dendrograms (tree diagrams).
- Clustering imposes structure, which may not always be present in the data.
Applications of Clustering
- Classification of species (taxonomy).
- Classification of vegetation communities.
- Classification of soil types.
- Used to classify areas for sampling.
- Less suitable for ecological communities where there are intermediate cases.
Clustering vs. Discriminant Function Analysis
- Clustering:
- Identifies groups without predefined categories.
- An unsupervised method, using the data to define the groups.
- Discriminant Function Analysis:
- Predefines groups and determines the differences between them.
- A supervised method.
- Not covered but similar to principal components analysis with predefined groups.
Steps in Clustering Analysis
- Generate distance or dissimilarity matrices.
- Choose a clustering approach.
Types of Clustering Approaches
- Agglomerative vs. Divisive.
- Agglomerative: Builds up clusters by adding samples.
- Divisive: Starts with one big group and divides it.
- Hierarchical vs. Non-Hierarchical.
- Hierarchical: Once a sample is in a group, it stays there.
- Non-Hierarchical: Samples can switch groups based on an iterative measure.
- Weighted: Emphasizes certain groupings.
Hierarchical Agglomerative Cluster Analysis
- Starts with a pairwise similarity or dissimilarity matrix.
- The most similar samples join first, then builds up.
- Groups are combined until there is one large group.
- Represented in a dendrogram.
Types of Linkage in Hierarchical Clustering
- Single Linkage:
- Uses the smallest dissimilarity.
- Also known as the nearest neighbor method.
- Produces cluster chains and elongated dendrograms.
- Complete Linkage:
- Uses the largest dissimilarity.
- Sensitive to outliers.
- Average Linkage:
- Uses the group means.
- Most common is the UPGMA (Unweighted Pair Group Method with Arithmetic Averages).
UPGMA (Unweighted Pair Group Method with Arithmetic Averages)
- Uses averages of different linkages.
- A step-by-step process is used for understanding.
- Weighted pair group mean methods can also be used, weighting some of the differences.
- Unweighted paired groups method uses centroids.
- Ward's minimal variance: Forms clusters by minimizing the within-cluster sum of squares; similar sized clusters.
- Uses Euclidean distance.
- The others can use any type.
Example of UPGMA
- Using a matrix with samples and species.
- Bay Curtis similarities are produced.
- Identify the highest similarity to form the first cluster.
- Calculate the next cluster by averaging the differences.
- Use proportional averaging to get the final clusters.
Non-Hierarchical Clustering
- Uses iterative processes and randomization.
- Samples are rearranged until the optimal cluster is achieved.
- K-means clustering is a common method, where K is the number of clusters.
- Membership is evaluated by defined criteria.
- K-means is based on metric Euclidean data.
Determining the Number of Clusters (K)
- Elbow Plot (Scree Plot):
- Plots the weighted sum of squares.
- Look for the elbow to determine the optimal number of clusters.
- Calinsky-Harabasz Criterion:
- KolinskyHarabasz = \frac{BetweenClusterVariance}{WithinClusterVariance}
- Uses the ratio of between-cluster variance to within-cluster variance.
- Higher values mean distinct and well-separated Clusters.
- Optimal K is the peak on the Calinsky-Harabasz plot.
- Silhouette Width:
- Method compares how similar objects are within clusters compared to others.
- Gap Statistic:
- Method compares the total within variation for different clusters and compares that under a different reference distribution.
Applications and Limitations of Cluster Analysis
- Useful for classifying things like soil samples, species, vegetation communities.
- Less useful for environmental scientists and ecologists due to its rigid classification.
- Ordinal methods are better for clinal data and gradients.