Cluster Analysis Notes

Multivariate Data and Analysis

  • Assumption laden and highly accessible - “Pandora’s box or panacea”.
  • Based on extent to which samples share a species at comparable abundance.
  • All use similarity coefficients calculated between each pair of samples.

Major Multivariate Approaches:

  • Clustering into groups.
  • Ordination (Mapped).

Clustering Analysis

The Question

  • Do samples associate with each other into groups?
  • Consider:
    • Finding groupings among “species” generated by using morphometric or genetic data.
    • Looking for groupings among sites based on species, soil characteristics, etc.

Examples

  • Genetic Structure within Rare Plant
    • Rare annual plant from California (Clarkia springvillensis).
    • Three main populations with subpopulations Gauging Station (GS), Springville Clarkia Ecological Reserve (SCER) and Bear Creek (BR).
    • Used Cavalli-Sforza chord genetic distances.
    • Analysis techniques: Cluster analysis, Hypothesis testing, nMDS, ANOSIM, SIMPER.
  • Patterns of bird assemblages in southeastern Australia
    • Birds counted in different vegetation types in Victoria.
    • Is there a pattern in bird assemblages for different vegetation types?
    • Example locations: Gippsland Manna Gum, River Red Gum, Box – Ironbark.
    • Analysis techniques: Cluster analysis, Hypothesis testing, nMDS, ANOSIM, SIMPER.

Clustering - the Output

  • Examples of cluster outputs include:
    • Lists of cities grouped together.
    • Molecular structures clustered into groups.
    • Soil types clustered based on characteristics.

Clustering (Classifying)

  • Aims to find “natural groupings”.
  • Samples within a group are more similar to each other than samples in different groups.
  • Usually results in a dendrogram.
  • Questions to consider:
    • Imposing a structure on the data or revealing the structure which actually exists?
  • Exploratory data analysis/hypothesis generation.
  • Common in taxonomy, less so in ecology.
  • Method must be numerical and the final number of classes is not known – compare with discriminant function analysis.

Clustering - The Process/Meaning

  • Generate distance matrix.
  • Choose clustering approach:
    • Agglomerative or divisive.
    • Hierarchical or non-hierarchical.
    • Weighting mean differences or not weighting.
  • Look for groupings.

Ways of Doing Cluster Analysis

  • Hundreds of methods exist.
  • Most methods are sequential.
  • Application of a sequence of operations that is either disjoint (agglomerative) or conjoint (divisive).
  • Hierarchical and non-hierarchical.
  • Cluster analysis usually “agglomerative hierarchical” in ecology and systematics; agriculture uses more non-hierarchical.

Hierarchical Agglomerative Cluster Analysis

  • Start with a pairwise similarity matrix among objects (individuals, sites, populations, taxa).
  • Most similar joined into a group.
  • Similarities of new groups to all others are calculated.
  • Two closest groups are combined repeatedly until one group remains.
  • 22 dimensional representation in dendrogram.

Types of Hierarchical Agglomerative Cluster Analysis

  • Single Linkage (nearest neighbour).
  • Complete Linkage (furthest neighbour).
  • Average Linkage (group average or mean):
    • Unweighted pair-groups method using arithmetic averages (UPGMA) – common!
    • Weighted pair-groups method using arithmetic averages (WPGMA).
    • Unweighted pair-groups method using arithmetic centroids (UPGMC).
  • Ward’s Minimum Variance Clustering.

Interpreting a Dendrogram

  • What’s a meaningful grouping?
  • An example distance calculation:

((2&3)+(3&4))/2=(67.9+42)/2=55((2\&3) + (3\&4))/2= (67.9+42)/2 = 55

UPGMA

  • “Unweighted paired group method using arithmetic averages”.
How to do hierarchical clustering in R
#find distance matrix (ie. Bray-Curtis)
d<-dist(as.matrix(data))
#apply hierarchical clustering
hc<-hclust(d)
#plot the dendrogram
plot(hc)

Non-Hierarchical Clustering

  • Start with a single object and cluster other objects that are similar to that one.
  • Objects can be reassigned to clusters during the clustering process.

K-Means Clustering

  • A type of non-hierarchical clustering.
  • Splits objects into a pre-defined number (KK) of clusters.
  • Cluster membership is iteratively re-evaluated by some criterion.
How to do K-means clustering in R?
irisCluster <- kmeans(iris[, 3:4], 3,  nstart = 20)
table(irisCluster$cluster,  iris$Species)
irisCluster$cluster <- as.factor(irisCluster$cluster)
ggplot(iris, aes(Petal.Length,  Petal.Width, color =  irisCluster$cluster)) + geom_point()

How Many Clusters?

  • By looking at a plot of the number of clusters x the Weighted Sum of Squares (dispersion in the groups around their respective means) – like a Scree Plot.
  • Calinski-Harabasz criterion.
  • Many others!
How Many Clusters in R?
mydata <- iris[,3:4]
wss <-  (nrow(mydata)- 1)*sum(apply(mydata,2,var))
for (i in  2:15) wss[i] <- sum(kmeans(mydata,  centers=i,nstart=20)$withinss)
plot(1:15, wss, type="b",  xlab="Number of Clusters",  ylab="Within groups sum of squares“ # Interpret like screeplot – see lab

Cluster Analysis – A Summary

  • Cluster analysis may be useful in a number of circumstances (e.g., taxonomy) BUT there are generally better options for environmental scientists.
  • Clustering is less useful (and potentially misleading) when there is steady gradation across sites/variables of interest.
  • Even for strongly grouped samples, there are other ways of representing groups.
  • Ordination and hypothesis tests are better options for well-designed work.