Multivariate Analysis and Cluster Analysis Notes
Multivariate Analysis: Distance and Dissimilarity Measures
Introduction to Distance and Dissimilarity Measures
- Multivariate analysis focuses on distance and dissimilarity measures.
- Principal components analysis (PCA) indirectly uses Euclidean distance measures.
- These measures are crucial for analyzing data and forming groupings, which leads to cluster analysis.
- Distance matrices can also be called dissimilarity matrices; sometimes similarity matrices are used.
Overview of the Next Few Lectures
- Concentrate on distance/dissimilarity matrices and similarity matrices.
- Discuss transformations and standardizations of data.
- Cover a range of cluster analysis methods, focusing on hierarchical and non-hierarchical clustering.
- Emphasis on identifying and describing patterns, similar to PCA.
Key Concepts and Hypothesis Testing
- Focus on describing patterns rather than hypothesis testing in PCA and clustering.
- Hypothesis testing will be covered in more detail next week.
Resources and Practice
- Recommended readings: Quinn & Keough (both old and new editions).
- Practice examples and interpret data using available online resources.
- Utilize datasets from both old and new editions for lectures and labs.
- Examples in the lab build upon the background provided in the lecture.
Why Use Distance/Dissimilarity Matrices?
- Analyzing multivariate data, like invertebrate species at different sites, necessitates considering interrelationships among variables.
- Avoid issues with Type I errors and account for species interactions.
- Simplify data into a more manageable format.
Applications of Distance and Dissimilarity Matrices
- Clustering: Form groups from distance/dissimilarity matrices to classify data.
- Ordination: Rearrange data in multi-dimensional space to produce a map (non-metric multi-dimensional scaling).
- Statistical Testing: Use Anosim/permanova to statistically test differences.
- Applicable to both natural experiments and manipulative experiments.
Defining Distance and Dissimilarity
- Evaluate species composition or morphology of organisms.
- Determine how different or alike samples are using distance and dissimilarity measures.
- Dissimilarity is often used for non-metric data, while distance is used for metric data.
- Matrices are often visualized in two or three-dimensional space.
Similarity vs. Dissimilarity Matrices
- Dissimilarity matrices are bounded by 0 and 100: 0 means identical, 100 means nothing in common.
- Similarity matrices: 100 means identical, 0 means nothing in common.
- Confusing similarity and dissimilarity can lead to incorrect results.
Mutual Absences
- Mutual absences: when a species is not present in both samples being compared.
- Mutual absences can cause statistical and biological problems.
- Linking sites based on mutual absences may not be biologically meaningful in ecological studies.
- Example: Antarctica and the tropics both lacking emus doesn't make them similar.
- In habitat data, mutual zeros may be important (e.g., 0% litter cover).
Simple Example and Dissimilarity Matrix
- Link sites together based on species composition.
- A dissimilarity matrix indicates how close sites are in multivariate space.
- Lower numbers indicate closer proximity in multivariate space.
- Example numbers: Sites 2 and 4 are most similar; sites 1 and 3 are most dissimilar.
Calculating Distances and Dissimilarities
- Use metric measurements for continuous or ratio data.
- Use non-metric measurements for ordinal or nominal count data.
Euclidean Distance
- Metric distance used in principal components analysis.
- Based on Pythagoras' theorem, actually Euclidean theory.
- d = \sqrt{\sum (xi - yi)^2}
- Where xi and yi are the values of the ith variable for the two points being compared.
- Link samples with exact same values, bounded by zero, with no upper limit.
Metric vs. Non-Metric and Bray-Curtis
- In two-dimensional space, Euclidean distance can be derived geometrically.
- Non-metric distance measurements cannot derive the distance.
- Metric uses absolute derived distances and non-metric data uses ranks.
Bray Curtis
- A commonly used non-metric measurement in Ecology.
- Derived by botanists to avoid linking samples based on joint absences.
- Also known as percentage dissimilarity.
- Ignores joint absences, and determinants are the variables with high values.
- Suited for species abundance data.
- BC{ij}=\frac{\sum |x{ij}-x{ik}|}{\sum (x{ij}+x_{ik})}
- Where:
- BC_{ij} is the Bray-Curtis dissimilarity between samples i and j
- x_{ij} is the abundance of species k in sample i
- x_{ik} is the abundance of species k in sample j
Bray Curtis Example
- Calculate the absolute differences of species A-E between site 1 and site 2.
- Calculates the addition of them and if you add zero to that it doesn't affect the number.
Implementation in R
- Using the Vegan package, calculate Bray-Curtis similarities/dissimilarity matrix.
Choosing Metric vs. Non-Metric
- Principal components uses Euclidean, so you can use the principal components in your linear regressions or your ANOVAs because it has the same sort of properties.
- Presence of zeros in species data often means you don't want things linked.
- In measurement data, joint absences may be important.
- Transform in univariate stats to make things more normal or to satisfy homogeneity of variance.
- In multivariate data analysis, transformations are used for different reasons.
- Non-metric analysis doesn't care about normality.
- Transformations can down-weight very common species by changing the scale of measurement.
- Examples include square root, log (adding one to avoid log of zero), fourth-root, and presence/absence transformations.
- log(x+1)
- Where x is the original value, add one because you can't log zero because a log of zero is negative infinity.
- Square root transformation makes the differences between numbers smaller.
- Fourth-root transformation further reduces the emphasis on common species.
- Presence/absence transformation converts all values to 1 (present) or 0 (absent).
- Consider the context of the data and analysis.
- Do not blindly apply transformations without justification.
- Determine if you want dominant variables to dominate the analysis.
- In exam questions, consider if the transformation is to increase linearity or to address the dominance of certain species.
Standardization
- Standardize to make each species equally important or to make each sample equally important.
- Useful for comparing samples of different sizes or with different sampling efforts.
- Express values as proportions or relative to the maximum value.
Calculating Proportions
- Divide each species number by the total.
- Useful when comparing the relative importance of species or samples.
Important Considerations
- Standardizations can change the interpretation of results.
- Compare raw data analysis to standardized data analysis.
- Transformation is usually better than standardization.
Summary
- Understand dissimilarity and distance measures.
- Know the difference between metric (Euclidean) and non-metric (Bray-Curtis) measures.
- Understand the role of transformations in both metric and non-metric analyses.
- Apply distance and dissimilarity matrices in clustering and ordination.
Clustering Analysis
Introduction to Cluster Analysis
- Cluster analysis groups samples based on the extent and samples kit.
- Methods use similarity coefficients between samples (Euclidean or Bray-Curtis).
- Can custom groups or map them in two or three-dimensional states.
Main Question in Clustering
- Looks against Descriptive things and not a statistical test.
- Do samples form natural groupings?
- May be used in taxonomy, genetics, ecology, soil science, etc.
Genetic Clusters
- Genetic data often has different assumptions about mutation rates and stuff.
- Caveali Zivosa is a called genetic distance
- Techniques in clustering and ordination.
- Treat diagram based on Genetic distance.