Multivariate Data and Analysis
- Assumption laden and highly accessible - “Pandora’s box or panacea”.
- Based on extent to which samples share a species at comparable abundance.
- All use similarity coefficients calculated between each pair of samples.
Major Multivariate Approaches:
- Clustering into groups.
- Ordination (Mapped).
Clustering Analysis
The Question
- Do samples associate with each other into groups?
- Consider:
- Finding groupings among “species” generated by using morphometric or genetic data.
- Looking for groupings among sites based on species, soil characteristics, etc.
Examples
- Genetic Structure within Rare Plant
- Rare annual plant from California (Clarkia springvillensis).
- Three main populations with subpopulations Gauging Station (GS), Springville Clarkia Ecological Reserve (SCER) and Bear Creek (BR).
- Used Cavalli-Sforza chord genetic distances.
- Analysis techniques: Cluster analysis, Hypothesis testing, nMDS, ANOSIM, SIMPER.
- Patterns of bird assemblages in southeastern Australia
- Birds counted in different vegetation types in Victoria.
- Is there a pattern in bird assemblages for different vegetation types?
- Example locations: Gippsland Manna Gum, River Red Gum, Box – Ironbark.
- Analysis techniques: Cluster analysis, Hypothesis testing, nMDS, ANOSIM, SIMPER.
Clustering - the Output
- Examples of cluster outputs include:
- Lists of cities grouped together.
- Molecular structures clustered into groups.
- Soil types clustered based on characteristics.
Clustering (Classifying)
- Aims to find “natural groupings”.
- Samples within a group are more similar to each other than samples in different groups.
- Usually results in a dendrogram.
- Questions to consider:
- Imposing a structure on the data or revealing the structure which actually exists?
- Exploratory data analysis/hypothesis generation.
- Common in taxonomy, less so in ecology.
- Method must be numerical and the final number of classes is not known – compare with discriminant function analysis.
Clustering - The Process/Meaning
- Generate distance matrix.
- Choose clustering approach:
- Agglomerative or divisive.
- Hierarchical or non-hierarchical.
- Weighting mean differences or not weighting.
- Look for groupings.
Ways of Doing Cluster Analysis
- Hundreds of methods exist.
- Most methods are sequential.
- Application of a sequence of operations that is either disjoint (agglomerative) or conjoint (divisive).
- Hierarchical and non-hierarchical.
- Cluster analysis usually “agglomerative hierarchical” in ecology and systematics; agriculture uses more non-hierarchical.
Hierarchical Agglomerative Cluster Analysis
- Start with a pairwise similarity matrix among objects (individuals, sites, populations, taxa).
- Most similar joined into a group.
- Similarities of new groups to all others are calculated.
- Two closest groups are combined repeatedly until one group remains.
- 2 dimensional representation in dendrogram.
Types of Hierarchical Agglomerative Cluster Analysis
- Single Linkage (nearest neighbour).
- Complete Linkage (furthest neighbour).
- Average Linkage (group average or mean):
- Unweighted pair-groups method using arithmetic averages (UPGMA) – common!
- Weighted pair-groups method using arithmetic averages (WPGMA).
- Unweighted pair-groups method using arithmetic centroids (UPGMC).
- Ward’s Minimum Variance Clustering.
Interpreting a Dendrogram
- What’s a meaningful grouping?
- An example distance calculation:
((2&3)+(3&4))/2=(67.9+42)/2=55
UPGMA
- “Unweighted paired group method using arithmetic averages”.
How to do hierarchical clustering in R
#find distance matrix (ie. Bray-Curtis)
d<-dist(as.matrix(data))
#apply hierarchical clustering
hc<-hclust(d)
#plot the dendrogram
plot(hc)
Non-Hierarchical Clustering
- Start with a single object and cluster other objects that are similar to that one.
- Objects can be reassigned to clusters during the clustering process.
K-Means Clustering
- A type of non-hierarchical clustering.
- Splits objects into a pre-defined number (K) of clusters.
- Cluster membership is iteratively re-evaluated by some criterion.
How to do K-means clustering in R?
irisCluster <- kmeans(iris[, 3:4], 3, nstart = 20)
table(irisCluster$cluster, iris$Species)
irisCluster$cluster <- as.factor(irisCluster$cluster)
ggplot(iris, aes(Petal.Length, Petal.Width, color = irisCluster$cluster)) + geom_point()
How Many Clusters?
- By looking at a plot of the number of clusters x the Weighted Sum of Squares (dispersion in the groups around their respective means) – like a Scree Plot.
- Calinski-Harabasz criterion.
- Many others!
How Many Clusters in R?
mydata <- iris[,3:4]
wss <- (nrow(mydata)- 1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata, centers=i,nstart=20)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares“ # Interpret like screeplot – see lab
Cluster Analysis – A Summary
- Cluster analysis may be useful in a number of circumstances (e.g., taxonomy) BUT there are generally better options for environmental scientists.
- Clustering is less useful (and potentially misleading) when there is steady gradation across sites/variables of interest.
- Even for strongly grouped samples, there are other ways of representing groups.
- Ordination and hypothesis tests are better options for well-designed work.