Multivariate Analysis: Distance and Dissimilarity Measures
Introduction to Distance and Dissimilarity Measures
Multivariate analysis focuses on distance and dissimilarity measures.
Principal components analysis (PCA) indirectly uses Euclidean distance measures.
These measures are crucial for analyzing data and forming groupings, which leads to cluster analysis.
Distance matrices can also be called dissimilarity matrices; sometimes similarity matrices are used.
Overview of the Next Few Lectures
Concentrate on distance/dissimilarity matrices and similarity matrices.
Discuss transformations and standardizations of data.
Cover a range of cluster analysis methods, focusing on hierarchical and non-hierarchical clustering.
Emphasis on identifying and describing patterns, similar to PCA.
Key Concepts and Hypothesis Testing
Focus on describing patterns rather than hypothesis testing in PCA and clustering.
Hypothesis testing will be covered in more detail next week.
Resources and Practice
Recommended readings: Quinn & Keough (both old and new editions).
Practice examples and interpret data using available online resources.
Utilize datasets from both old and new editions for lectures and labs.
Examples in the lab build upon the background provided in the lecture.
Why Use Distance/Dissimilarity Matrices?
Analyzing multivariate data, like invertebrate species at different sites, necessitates considering interrelationships among variables.
Avoid issues with Type I errors and account for species interactions.
Simplify data into a more manageable format.
Applications of Distance and Dissimilarity Matrices
Clustering: Form groups from distance/dissimilarity matrices to classify data.
Ordination: Rearrange data in multi-dimensional space to produce a map (non-metric multi-dimensional scaling).
Statistical Testing: Use Anosim/permanova to statistically test differences.
Applicable to both natural experiments and manipulative experiments.
Defining Distance and Dissimilarity
Evaluate species composition or morphology of organisms.
Determine how different or alike samples are using distance and dissimilarity measures.
Dissimilarity is often used for non-metric data, while distance is used for metric data.
Matrices are often visualized in two or three-dimensional space.
Similarity vs. Dissimilarity Matrices
Dissimilarity matrices are bounded by 0 and 100: 0 means identical, 100 means nothing in common.
Similarity matrices: 100 means identical, 0 means nothing in common.
Confusing similarity and dissimilarity can lead to incorrect results.
Mutual Absences
Mutual absences: when a species is not present in both samples being compared.
Mutual absences can cause statistical and biological problems.
Linking sites based on mutual absences may not be biologically meaningful in ecological studies.
Example: Antarctica and the tropics both lacking emus doesn't make them similar.
In habitat data, mutual zeros may be important (e.g., 0% litter cover).
Simple Example and Dissimilarity Matrix
Link sites together based on species composition.
A dissimilarity matrix indicates how close sites are in multivariate space.
Lower numbers indicate closer proximity in multivariate space.
Example numbers: Sites 2 and 4 are most similar; sites 1 and 3 are most dissimilar.
Calculating Distances and Dissimilarities
Use metric measurements for continuous or ratio data.
Use non-metric measurements for ordinal or nominal count data.
Euclidean Distance
Metric distance used in principal components analysis.
Based on Pythagoras' theorem, actually Euclidean theory.
d=∑(x<em>i−y</em>i)2
Where x<em>i and y</em>i are the values of the ith variable for the two points being compared.
Link samples with exact same values, bounded by zero, with no upper limit.
Metric vs. Non-Metric and Bray-Curtis
In two-dimensional space, Euclidean distance can be derived geometrically.
Non-metric distance measurements cannot derive the distance.
Metric uses absolute derived distances and non-metric data uses ranks.
Bray Curtis
A commonly used non-metric measurement in Ecology.
Derived by botanists to avoid linking samples based on joint absences.
Also known as percentage dissimilarity.
Ignores joint absences, and determinants are the variables with high values.
Suited for species abundance data.
BC<em>ij=∑(x</em>ij+xik)∑∣x</em>ij−x<em>ik∣
Where:
BCij is the Bray-Curtis dissimilarity between samples i and j
xij is the abundance of species k in sample i
xik is the abundance of species k in sample j
Bray Curtis Example
Calculate the absolute differences of species A-E between site 1 and site 2.
Calculates the addition of them and if you add zero to that it doesn't affect the number.
Implementation in R
Using the Vegan package, calculate Bray-Curtis similarities/dissimilarity matrix.
Choosing Metric vs. Non-Metric
Principal components uses Euclidean, so you can use the principal components in your linear regressions or your ANOVAs because it has the same sort of properties.
Presence of zeros in species data often means you don't want things linked.
In measurement data, joint absences may be important.
Standardization and Transformation
Transform in univariate stats to make things more normal or to satisfy homogeneity of variance.
In multivariate data analysis, transformations are used for different reasons.
Non-metric analysis doesn't care about normality.
Transformations
Transformations can down-weight very common species by changing the scale of measurement.
Examples include square root, log (adding one to avoid log of zero), fourth-root, and presence/absence transformations.
log(x+1)
Where x is the original value, add one because you can't log zero because a log of zero is negative infinity.
Examples of Transformations
Square root transformation makes the differences between numbers smaller.
Fourth-root transformation further reduces the emphasis on common species.
Presence/absence transformation converts all values to 1 (present) or 0 (absent).
Transformation Considerations
Consider the context of the data and analysis.
Do not blindly apply transformations without justification.
Determine if you want dominant variables to dominate the analysis.
In exam questions, consider if the transformation is to increase linearity or to address the dominance of certain species.
Standardization
Standardize to make each species equally important or to make each sample equally important.
Useful for comparing samples of different sizes or with different sampling efforts.
Express values as proportions or relative to the maximum value.
Calculating Proportions
Divide each species number by the total.
Useful when comparing the relative importance of species or samples.
Important Considerations
Standardizations can change the interpretation of results.
Compare raw data analysis to standardized data analysis.
Transformation is usually better than standardization.
Summary
Understand dissimilarity and distance measures.
Know the difference between metric (Euclidean) and non-metric (Bray-Curtis) measures.
Understand the role of transformations in both metric and non-metric analyses.
Apply distance and dissimilarity matrices in clustering and ordination.
Clustering Analysis
Introduction to Cluster Analysis
Cluster analysis groups samples based on the extent and samples kit.
Methods use similarity coefficients between samples (Euclidean or Bray-Curtis).
Can custom groups or map them in two or three-dimensional states.
Main Question in Clustering
Looks against Descriptive things and not a statistical test.
Do samples form natural groupings?
May be used in taxonomy, genetics, ecology, soil science, etc.
Genetic Clusters
Genetic data often has different assumptions about mutation rates and stuff.