Looks like no one added any tags here yet for you.
Correlation
One variable changes when the other variable changes
Regression
One variable changes because of the other variable changing
Assumptions of Correlation
Both variables are continuous
Both variables are normally distributed
Drawbacks of Correlation
No assumption of causation
May miss non-linear relationships
Coefficient of Determination
R²
Cook’s distance
A value which dramatically effects a regression
Has unusual X and Y values
Multicollinearity
An independent variable highly correlated with another independent variable
What are the assumptions of Linear Regression?
Linear Relationship between X and Y
Normal distribution of Y at each value of X
Variance of Y is the same at each value of X
No correlation of errors
Covariate
Any continuous value that is not of direct interest
Model I Regression
Assumes X values are fixed by design
Model II Regression
Does not assume X values are fixed by design
When to use ANOVA?
2 independent categorical variables
When to use ANCOVA?
1 independent continuous variable, 1 independent categorical variable
When to use Multiple Regression?
2 independent continuous variables
Random effect
Any categorical variable with more than 5 levels that we are not directly interested in
Blocking variable
Any categorical variable with 5 or less levels that we are not directly interested in
Conditional R²
Explained variance in a whole mixed model
Marginal R²
Explained variance by fixed effects in a mixed model
General Linear Models
Linear Regression
ANOVA
ANCOVA
Generalized Linear Models
Logistic regression
Poisson regression
ANOVA
Components to a GLM
Random component
Systematic component
Link function
Random component
Probability distribution of a response variable
Systematic component
Explanatory variables as a combination of linear predictors
Link function
How the explanatory variables are related to the response variables
Fixed effects
Variables which are of direct interest
Logistic Regression
When you have a continuous predictor and a categorical response
Logit function
The link function in a Logistic Regression
Null-Hypothesis Testing
Decision based on acceptance or rejection
Information Theoretic Approach
Develops a likelihood of a model being correct
Bayesian Inference
Update beliefs about a parameter’s distribution based on a prior probability and a likelihood function.
Assumptions of Logistic Regression
Independent Error terms
Little to no multicollinearity
Non-assumptions of Logistic Regression
No linear relationship necessary
Independent variables do not need to be normal
No homoscedascticity
No continuous independent variables
Stepwise Regression
Building the best model by examining the impact of each variable to a model
Forward Selection
Build a model from scratch, adding variables if they significantly increase the model fit
Backward Elimination
Deconstruct a global model, removing variables until the model fits the data the best it can
Akaike’s Information Criteria
Selects the best model from a combination of model fit and parsimony
What information is needed to calculate AIC?
SSE or Log likelihood
Sample size
Number of parameters in the model
ΔAIC
AIC for current model - AIC for smallest model
w_i
AIC Model Probability
Effect Size
The magnitude of an effect
Types of effect statistics
d-stats
r-stats
odds ratios
Statistical Power
Probability of correctly finding a real pattern
What is the equation for statistical power?
1-β
Power analysis
The examining of a statistical test to ensure it has enough power to make a reasonable conclusion
What 3 factors affect statistical power?
Sample Size
alpha
Effect size
A priori Power Analysis
Power analysis done before an experiment to test if the sample size is large enough to detect a significant effect
Post hoc power analysis
Power analysis done after an experiment to test if the sample size was large enough to detect a significant effect
Steps to perform power analysis
Choose type
Select expected study design
Select tool which supports design
Provide 3 of 4 parameters
Overfitting
Creation of a model which is too focused on a certain set of data
Multivariate Data
Data with many dependent/response variables
Variables have interactions
Covariates
Non-parametric data
Independence may be violated
Variances are unequal
Not normally distributed
What is a decision tree?
A Non-parametric algorithm to classify and make predictions based on inputs
What is a random forest?
A series of multiple, randomly created decision trees
How many decision trees usually compose a random forest?
1000
How is a random forest made?
Training Dataset
Bootstrapping
Create individual decision trees from bootstrapping
Collection of answers from the decision tree, choosing the majority decision in a process called Bagging
What is cluster analysis?
The grouping of data points into clusters based upon similar traits
Why should you use cluster analysis?
Reveal hidden patterns
Hard clustering
Each data point in a cluster analysis belongs only to one cluster
Soft clustering
Each data point in a cluster analysis is given a probability it would be found in one cluster or another
Hierarchical Clustering
Clustering based on relationship between data points
How is hierarchical clustering performed?
Finding the greatest vertical distance in a dendrogram made up of the same degree of splitness
K-Means Clustering
Choosing a predefined number of centroids (K) which the data will be clustered too
What is a Centroid?
The mean of a cluster point
How to choose K?
Elbow method
Silhouette method
What are HBIs?
Long-chained Alkenes produced by Marine Diatoms
How to use HBIs in data analysis?
They are produced by different forms of algae, and are thus biomarkers of what algae are primarily being consumed in the food web
What is H-Print?
A singular index for multiple biomarkers
Lower values mean it’s more sympagic, higher means more pelagic
iPOC
Index indicating the proportion of organic carbon derived from sea ice algae
Sea Ice Algae
Sympagic Diatoms which produce HBI I
Phytoplanktonic Algae
Pelagic algae which produce HBI III
Kernel Density Estimation
A visual display of a probability distribution using density curves
Bandwidth
A scalar for the width of a kernel
Ecological Spatial Analysis
Relationship between the observed spatial distribution of a species and the mechanisms behind that distribution
Minimum Convex Polygon
Draws the smallest polygon around a series of points with all interior angles being less than 180 degrees
Utilization Distribution
A method for determining an organisms home range based upon density points
Can use Kernel Density Estimation to get this
How do you collect shape data?
Take standardized photographs (include a scale reference)
Digitize landmarks for shape and ensure they’re consistent and repeatable
What is General Procrustes Analysis
An analysis which outputs centroid sizes and coordinates which represent the shape
It preserves euclidean distance, and scales/transforms/rotates so the images have a common frame of reference
Procrustes ANOVA
Determines the variation in shape caused by one or more factors
Residual Randomization in Permutation Procedures
Sums of squares are calculated across many permutations to determine effect probabilities
Assumptions of PCA?
Correlation in data
Most data points being non-zeros
Steps to a PCA
Centering & scaling the data
Calculating covariance matrix
Calculating eigenvalues/vectors
Finding principle components
Covariance matrix
Matrix with each variable appearing in the rows and columns, where variance is shown for every variable and covariance is shown for different variables
Calculate Eigenvalues of a covariance matrix
Find the determinant of the covariance matrix and solve for lambda to find the variances for the new axes
Redundancy analysis
Allows you to find correlation between a predictor and a response and visually graph them in a tri-plot
Survival Analysis
Statistical method to analyze “time to event” data
Survival Function
Probability an event hasn’t occurred by a given time point
Survival Curves
Graphical representation of event occurrence over time
What are some characteristics of Time to Event Data?
Non-negative values
Non-normal distribution (right-skewed)
Right censoring
Event isn’t observed within a study period
Left censoring
Event occurs before a study period
Random censoring
Event occurs independently of time to event
Interval censoring
Specific time of event is unknown, but does happen in the interval
Kaplan-Meier Survival Curve
Non-parametric method used to estimate survival function
Log-rank test
Non-parametric used to compare survival function curves between two groups
Cox Proportional Hazards Model
Semi-parametric method used to assess impact of covariates on Hazard rate
Hazard Rate
Rate at which subjects experience event
Prior distribution
Framework for parameters in Bayesian analysis based on what we already know
Posterior distribution
Prior distribution of bayesian analysis with data added to it
How to interpret Bayesian analysis?
Confidence intervals and data visualization
Why use Bayesian methods?
Flexibility
Robustness
Nuance