Cluster Analysis Notes

Multivariate Data and Analysis

Assumption laden and highly accessible - “Pandora’s box or panacea”.
Based on extent to which samples share a species at comparable abundance.
All use similarity coefficients calculated between each pair of samples.

Major Multivariate Approaches:

Clustering into groups.
Ordination (Mapped).

Clustering Analysis

The Question

Do samples associate with each other into groups?
Consider:
- Finding groupings among “species” generated by using morphometric or genetic data.
- Looking for groupings among sites based on species, soil characteristics, etc.

Examples

Genetic Structure within Rare Plant
- Rare annual plant from California (Clarkia springvillensis).
- Three main populations with subpopulations Gauging Station (GS), Springville Clarkia Ecological Reserve (SCER) and Bear Creek (BR).
- Used Cavalli-Sforza chord genetic distances.
- Analysis techniques: Cluster analysis, Hypothesis testing, nMDS, ANOSIM, SIMPER.
Patterns of bird assemblages in southeastern Australia
- Birds counted in different vegetation types in Victoria.
- Is there a pattern in bird assemblages for different vegetation types?
- Example locations: Gippsland Manna Gum, River Red Gum, Box – Ironbark.
- Analysis techniques: Cluster analysis, Hypothesis testing, nMDS, ANOSIM, SIMPER.

Clustering - the Output

Examples of cluster outputs include:
- Lists of cities grouped together.
- Molecular structures clustered into groups.
- Soil types clustered based on characteristics.

Clustering (Classifying)

Aims to find “natural groupings”.
Samples within a group are more similar to each other than samples in different groups.
Usually results in a dendrogram.
Questions to consider:
- Imposing a structure on the data or revealing the structure which actually exists?
Exploratory data analysis/hypothesis generation.
Common in taxonomy, less so in ecology.
Method must be numerical and the final number of classes is not known – compare with discriminant function analysis.

Clustering - The Process/Meaning

Generate distance matrix.
Choose clustering approach:
- Agglomerative or divisive.
- Hierarchical or non-hierarchical.
- Weighting mean differences or not weighting.
Look for groupings.

Ways of Doing Cluster Analysis

Hundreds of methods exist.
Most methods are sequential.
Application of a sequence of operations that is either disjoint (agglomerative) or conjoint (divisive).
Hierarchical and non-hierarchical.
Cluster analysis usually “agglomerative hierarchical” in ecology and systematics; agriculture uses more non-hierarchical.

Hierarchical Agglomerative Cluster Analysis

Start with a pairwise similarity matrix among objects (individuals, sites, populations, taxa).
Most similar joined into a group.
Similarities of new groups to all others are calculated.
Two closest groups are combined repeatedly until one group remains.
$2$ dimensional representation in dendrogram.

Types of Hierarchical Agglomerative Cluster Analysis

Single Linkage (nearest neighbour).
Complete Linkage (furthest neighbour).
Average Linkage (group average or mean):
- Unweighted pair-groups method using arithmetic averages (UPGMA) – common!
- Weighted pair-groups method using arithmetic averages (WPGMA).
- Unweighted pair-groups method using arithmetic centroids (UPGMC).
Ward’s Minimum Variance Clustering.

Interpreting a Dendrogram

What’s a meaningful grouping?
An example distance calculation:

$((2\&3) + (3\&4))/2= (67.9+42)/2 = 55$

UPGMA

“Unweighted paired group method using arithmetic averages”.

How to do hierarchical clustering in R

#find distance matrix (ie. Bray-Curtis)
d<-dist(as.matrix(data))
#apply hierarchical clustering
hc<-hclust(d)
#plot the dendrogram
plot(hc)

Non-Hierarchical Clustering

Start with a single object and cluster other objects that are similar to that one.
Objects can be reassigned to clusters during the clustering process.

K-Means Clustering

A type of non-hierarchical clustering.
Splits objects into a pre-defined number ( $K$ ) of clusters.
Cluster membership is iteratively re-evaluated by some criterion.

How to do K-means clustering in R?

irisCluster <- kmeans(iris[, 3:4], 3,  nstart = 20)
table(irisCluster$cluster,  iris$Species)
irisCluster$cluster <- as.factor(irisCluster$cluster)
ggplot(iris, aes(Petal.Length,  Petal.Width, color =  irisCluster$cluster)) + geom_point()

How Many Clusters?

By looking at a plot of the number of clusters x the Weighted Sum of Squares (dispersion in the groups around their respective means) – like a Scree Plot.
Calinski-Harabasz criterion.
Many others!

How Many Clusters in R?

mydata <- iris[,3:4]
wss <-  (nrow(mydata)- 1)*sum(apply(mydata,2,var))
for (i in  2:15) wss[i] <- sum(kmeans(mydata,  centers=i,nstart=20)$withinss)
plot(1:15, wss, type="b",  xlab="Number of Clusters",  ylab="Within groups sum of squares“ # Interpret like screeplot – see lab

Cluster Analysis – A Summary

Cluster analysis may be useful in a number of circumstances (e.g., taxonomy) BUT there are generally better options for environmental scientists.
Clustering is less useful (and potentially misleading) when there is steady gradation across sites/variables of interest.
Even for strongly grouped samples, there are other ways of representing groups.
Ordination and hypothesis tests are better options for well-designed work.