Multivariate Analysis and Cluster Analysis Notes

Multivariate Analysis: Distance and Dissimilarity Measures

Introduction to Distance and Dissimilarity Measures

  • Multivariate analysis focuses on distance and dissimilarity measures.
  • Principal components analysis (PCA) indirectly uses Euclidean distance measures.
  • These measures are crucial for analyzing data and forming groupings, which leads to cluster analysis.
  • Distance matrices can also be called dissimilarity matrices; sometimes similarity matrices are used.

Overview of the Next Few Lectures

  • Concentrate on distance/dissimilarity matrices and similarity matrices.
  • Discuss transformations and standardizations of data.
  • Cover a range of cluster analysis methods, focusing on hierarchical and non-hierarchical clustering.
  • Emphasis on identifying and describing patterns, similar to PCA.

Key Concepts and Hypothesis Testing

  • Focus on describing patterns rather than hypothesis testing in PCA and clustering.
  • Hypothesis testing will be covered in more detail next week.

Resources and Practice

  • Recommended readings: Quinn & Keough (both old and new editions).
  • Practice examples and interpret data using available online resources.
  • Utilize datasets from both old and new editions for lectures and labs.
  • Examples in the lab build upon the background provided in the lecture.

Why Use Distance/Dissimilarity Matrices?

  • Analyzing multivariate data, like invertebrate species at different sites, necessitates considering interrelationships among variables.
  • Avoid issues with Type I errors and account for species interactions.
  • Simplify data into a more manageable format.

Applications of Distance and Dissimilarity Matrices

  • Clustering: Form groups from distance/dissimilarity matrices to classify data.
  • Ordination: Rearrange data in multi-dimensional space to produce a map (non-metric multi-dimensional scaling).
  • Statistical Testing: Use Anosim/permanova to statistically test differences.
  • Applicable to both natural experiments and manipulative experiments.

Defining Distance and Dissimilarity

  • Evaluate species composition or morphology of organisms.
  • Determine how different or alike samples are using distance and dissimilarity measures.
  • Dissimilarity is often used for non-metric data, while distance is used for metric data.
  • Matrices are often visualized in two or three-dimensional space.

Similarity vs. Dissimilarity Matrices

  • Dissimilarity matrices are bounded by 0 and 100: 0 means identical, 100 means nothing in common.
  • Similarity matrices: 100 means identical, 0 means nothing in common.
  • Confusing similarity and dissimilarity can lead to incorrect results.

Mutual Absences

  • Mutual absences: when a species is not present in both samples being compared.
  • Mutual absences can cause statistical and biological problems.
  • Linking sites based on mutual absences may not be biologically meaningful in ecological studies.
  • Example: Antarctica and the tropics both lacking emus doesn't make them similar.
  • In habitat data, mutual zeros may be important (e.g., 0% litter cover).

Simple Example and Dissimilarity Matrix

  • Link sites together based on species composition.
  • A dissimilarity matrix indicates how close sites are in multivariate space.
  • Lower numbers indicate closer proximity in multivariate space.
  • Example numbers: Sites 2 and 4 are most similar; sites 1 and 3 are most dissimilar.

Calculating Distances and Dissimilarities

  • Use metric measurements for continuous or ratio data.
  • Use non-metric measurements for ordinal or nominal count data.

Euclidean Distance

  • Metric distance used in principal components analysis.
  • Based on Pythagoras' theorem, actually Euclidean theory.
  • d = \sqrt{\sum (xi - yi)^2}
  • Where xi and yi are the values of the ith variable for the two points being compared.
  • Link samples with exact same values, bounded by zero, with no upper limit.

Metric vs. Non-Metric and Bray-Curtis

  • In two-dimensional space, Euclidean distance can be derived geometrically.
    • Non-metric distance measurements cannot derive the distance.
  • Metric uses absolute derived distances and non-metric data uses ranks.

Bray Curtis

  • A commonly used non-metric measurement in Ecology.
  • Derived by botanists to avoid linking samples based on joint absences.
  • Also known as percentage dissimilarity.
  • Ignores joint absences, and determinants are the variables with high values.
  • Suited for species abundance data.
  • BC{ij}=\frac{\sum |x{ij}-x{ik}|}{\sum (x{ij}+x_{ik})}
  • Where:
    • BC_{ij} is the Bray-Curtis dissimilarity between samples i and j
    • x_{ij} is the abundance of species k in sample i
    • x_{ik} is the abundance of species k in sample j
Bray Curtis Example
  • Calculate the absolute differences of species A-E between site 1 and site 2.
  • Calculates the addition of them and if you add zero to that it doesn't affect the number.

Implementation in R

  • Using the Vegan package, calculate Bray-Curtis similarities/dissimilarity matrix.

Choosing Metric vs. Non-Metric

  • Principal components uses Euclidean, so you can use the principal components in your linear regressions or your ANOVAs because it has the same sort of properties.
  • Presence of zeros in species data often means you don't want things linked.
  • In measurement data, joint absences may be important.

Standardization and Transformation

  • Transform in univariate stats to make things more normal or to satisfy homogeneity of variance.
  • In multivariate data analysis, transformations are used for different reasons.
  • Non-metric analysis doesn't care about normality.

Transformations

  • Transformations can down-weight very common species by changing the scale of measurement.
  • Examples include square root, log (adding one to avoid log of zero), fourth-root, and presence/absence transformations.
  • log(x+1)
  • Where x is the original value, add one because you can't log zero because a log of zero is negative infinity.

Examples of Transformations

  • Square root transformation makes the differences between numbers smaller.
  • Fourth-root transformation further reduces the emphasis on common species.
  • Presence/absence transformation converts all values to 1 (present) or 0 (absent).

Transformation Considerations

  • Consider the context of the data and analysis.
  • Do not blindly apply transformations without justification.
  • Determine if you want dominant variables to dominate the analysis.
  • In exam questions, consider if the transformation is to increase linearity or to address the dominance of certain species.

Standardization

  • Standardize to make each species equally important or to make each sample equally important.
  • Useful for comparing samples of different sizes or with different sampling efforts.
  • Express values as proportions or relative to the maximum value.

Calculating Proportions

  • Divide each species number by the total.
  • Useful when comparing the relative importance of species or samples.

Important Considerations

  • Standardizations can change the interpretation of results.
  • Compare raw data analysis to standardized data analysis.
  • Transformation is usually better than standardization.

Summary

  • Understand dissimilarity and distance measures.
  • Know the difference between metric (Euclidean) and non-metric (Bray-Curtis) measures.
  • Understand the role of transformations in both metric and non-metric analyses.
  • Apply distance and dissimilarity matrices in clustering and ordination.

Clustering Analysis

Introduction to Cluster Analysis

  • Cluster analysis groups samples based on the extent and samples kit.
  • Methods use similarity coefficients between samples (Euclidean or Bray-Curtis).
  • Can custom groups or map them in two or three-dimensional states.

Main Question in Clustering

  • Looks against Descriptive things and not a statistical test.
  • Do samples form natural groupings?
  • May be used in taxonomy, genetics, ecology, soil science, etc.

Genetic Clusters

  • Genetic data often has different assumptions about mutation rates and stuff.
  • Caveali Zivosa is a called genetic distance
  • Techniques in clustering and ordination.
  • Treat diagram based on Genetic distance.