1/159
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
Identifier AKA key
Number or text that is used to identify and track unique rows of data in a data table
Sampling units
Any elements selected from a statistical population that make up your sample
continous data
Data that can take any value
discrete data
Numerical data values that can be COUNTED, whole numbers
intreval data
continous data that can have negative values
circular data
Data where highest value loops back around to lowest (e.g. clocks)
ordinal data
a type of data that refers solely to a ranking of some kind
nominal data
data of categories only. Data cannot be arranged in an ordering scheme. (Gender, Race, Religion)
Outcome
= Intercept + Slope * Predictor + Error
A model
is a mathematical equation that describes patterns in the data
resolution
Smallest difference in an object you can measure
30 to 300 rule
Find the maximum and the minimum value in a dataset. The number of unit steps between the two should be between 30 and 300
The fundamental difference between ordinal and nominal data is
there is NO order to nominal data.
wide-form flat file
A table where rows represents all the information collected for/ about a sampling unit
All the data collected for that sampling unit are contained in a row
5 types of derived variables
1. Ratio: Expresses as single value the relationship two variables have to one another
2. Proportion: A ratio with 1 understood as the denominator
3. Percentage: A ratio with 100 understood as the denominator
4. Rate: Amount of something that happens per unit time or space
5. Index: A general description of derived variables
Long-form flat file
Data stored in a single table (i.e. one worksheet).
Use far fewer columns & many more rows.
In our example, data is entered as two variables (in columns), one for species identity and one for species abundance.
meta data
Metadata is a file and/or variable comment that explains in detail what each variable is, how the data were collected, and any key assumptions of the data collection process.
Two good ways to provide metadata within your spreadsheet.
1) Create a worksheet at the start of the workbook called Metadata that outlines what fields are
2) Create a Metadata sheet and use Comments in the label for each column
SUMMARIZING PATTERNS IN DATA VIA DESCRIPTIVE STATISTICS
1. Measures of location - statistics that demonstrate where majority of data lie
2. Measures of spread - statistics that demonstrate how variable/ different your data are
3. Measures of certainty - statistics that tell you how well your sample represents your population
4. Measures of shape - statistics that describe the distribution of your data
inferential statistics
They use sample data to make generalizations or predictions about a population.
What are measures of location?
Statistics that describe the center or typical value of a dataset (mean, median, mode).
What is the arithmetic mean?
The sum of all values divided by the number of observations.
: What is the geometric mean used for?
Growth rates or percentages.
What is the harmonic mean used for?
Rates and ratios (like speed).
What does variance measure?
The average squared deviation from the mean.
What's the difference between sample and population variance?
Sample variance divides by (n - 1); population variance divides by N.
What is MAD?
The average of absolute differences from the mean.
What does the 68-95-99.7 rule tell us?
About 68% of data lie within 1 SD, 95% within 2 SDs, 99.7% within 3 SDs.
What are measures of certainty?
Statistics that describe how confident we are in our sample estimates.
What is margin of error?
Half the width of a confidence interval
What does a PDF represent?
The continuous probability curve — total area under curve = 1.
What is a CDF?
It shows cumulative probability up to each value.
What's the main difference between descriptive and inferential stats?
Descriptive = describe data; Inferential = predict or generalize from data.
What determines good data quality?
Representativeness, randomness, and lack of bias in data collection.
Define stratified random sample.
A sample where the population is divided into natural groups called strata, and random samples are taken from each.
What are strata?
Natural groupings within a population that do not overlap and together make up the whole population.
What is a stratified estimate of the mean also called?
A weighted mean.
List benefits of a stratified design.
Increases precision, ensures representation of all groups, and reduces sampling error.
Steps to get a stratified sample.
1. Divide population into non-overlapping strata. 2. Take random samples within each stratum. 3. Use weighted formulas to estimate overall mean and variance.
Define proportional allocation.
Sampling where the number of elements from each stratum is proportional to the stratum's size in the population.
Rules of proportional allocation.
1. Must round to nearest whole number. 2. Need at least two samples per stratum.
What is optimal allocation?
A design using prior population information and cost functions to increase sampling efficiency.
Three rules of thumb for optimal allocation.
1. Larger strata → more samples. 2. More variable strata → more samples. 3. Cheaper strata → more samples.
Define systematic sampling.
Sampling every k-th unit from an ordered list after a random start.
What is a complex sample?
A combination of sampling plans, like door-to-door surveys.
What do inferential statistics do?
Draw conclusions about a population from sample data.
What does the frequentist paradigm assume?
There is a single truth we estimate using samples.
Define frequentist statistics.
A statistical approach using event frequencies to make conclusions.
Define type I error.
Rejecting the null hypothesis when it's true.
Define type II error.
Failing to reject the null hypothesis when it's false.
What is a model?
A simplification of reality using predictor variables to explain variation in a response variable.
Write the general model equation.
Outcome = Intercept + Slope × Predictor + Error.
Define response variable.
The measured outcome you want to understand or explain.
Define predictor variable.
A variable that explains or predicts changes in the response variable.
Define slope.
Rate at which the response changes for a one-unit change in predictor.
Define intercept.
Mean response value when all predictors equal zero.
Define error (residual).
Difference between observed and predicted response values.
Define categorical variable.
A qualitative variable with a limited set of fixed values.
Define dummy variables.
Numeric representations of qualitative categories for modeling.
What are estimates in a model?
Values of slopes (betas) and intercepts describing predictor-response relationships.
How are p-values useful?
They test whether variation is explained by the model or if predictors are significant.
What do residual plots show?
Whether model errors are random and assumptions are met.
List goals in building a useful model.
1. Include enough variables for unbiased estimates. 2. Avoid too many variables for precision. 3. Balance bias and precision.
What is AIC?
Akaike Information Criterion, used to compare nested models for best fit with fewest variables.
Define Gaussian (normal) variable.
A variable with a symmetric distribution centered around the mean.
Properties of a normal distribution.
1. Most observations near mean. 2. Symmetrical shape. 3. Infinite tails.
Define correlation.
Measures strength and direction of a linear relationship between two variables.
Define partial correlation.
Relationship between two variables while controlling for others.
What are predicted values?
The arithmetic mean response predicted by the model.
What does the residual value show?
How far observed data deviate from predicted values.
Define 95% confidence interval.
Range where the true slope would fall 95 times out of 100 repeated samples.
What does a 95% confidence band represent?
The range of uncertainty for predicted values over predictor values.
Define linear model.
A statistical model describing response-predictor relationships using a linear equation and least squares regression.
Steps to build a model.
1. Plot response histogram. 2. Plot predictors. 3. Check scatterplots. 4. Calculate slope/intercept. 5. Assess fit. 6. Check residuals.
How does a GLM estimate slopes and intercepts?
By maximizing the likelihood function through repeated guesses until best fit is found.
WHAT DO I DO IF THE ERROR FAMILY IS NOT NORMAL?
OPTION 1: Transform it in order to normalize it and use linear model (LM)
OPTION 2: Figure out what distribution fits the data the best and model accordingly (GLM)
OPTION 3: Convert it to Ordinal data and ignore distribution (Non-parametric)
ALLOMETRY EXAMPLE OF WHERE TRANSFORMATION IS COMMON
How the characteristics of living creatures change with size
Isometric scaling happens when proportional relationships are preserved as size (1 to 1)
Log-log transformation of data and LM are the standard by which we measure allometry and if a relationship deviates from isometry
measures how the geometric mean changes
allometry defintions
Evolutionary allometry: Relationship observed between species
Ontogenetic allometry: Relationship of trait Y against trait X during development (how fast does heart grow relative to body?)
Static allometry: Relationship between individuals in a biological population at same stage of development
GENERALIZED LINEAR MODEL
Is a linear model with a different error distribution (probability density function) that better captures data generating process
If range of values is constrained (> 0, truncated to some max), variance is not constant or residuals not normally distributed might use GLM
How can both LM and GLM be viewed conceptually?
Response=deterministic part+stochastic part
What is the deterministic part of the model?
It predicts the mean — also called the systematic part of the equation.
What is the stochastic part of the model?
: The error or random part that models variability and uncertainty.
What is the key difference between a data transformation and a GLM link function?
Data transformation changes both mean and variance.
Link function only transforms the mean (systematic part).
When are LM and GLM the same?
When the error is normal and the identity link is used.
The log-normal is a probability distribution:
1) Let's us model multiplicative relationships. The linear model with normal distribution is additive
2) Ensures negative / zero values can't be predicted (normal will make such predictions)
3) Is one way of dealing with positive skew and heterogenous variance
4) Is typically used with continuous and interval data AND sometimes discrete
5) Has longer right tail with larger SD
6) Has a mode and median shifted to left relative to arithmetic mean
What are the two parts of a model?
Response = deterministic (systematic) part + stochastic (random) part
What does the deterministic part represent?
It predicts the mean (systematic component) of the response
What does the stochastic part represent?
The error or variability in the model
What makes a GLM different from a linear model?
GLM allows non-normal error distributions and uses a link function
What are the three components of a GLM?
Systematic component, random component, and link function
What is the role of the link function?
It connects the mean of the response to the linear predictor
What does a link function do?
It transforms the response and constrains predicted values within valid range
What happens when a link is not the identity function?
The model becomes nonlinear on the original scale
What is the purpose of transformations in modelling?
To stabilize variance and linearize relationships
What is the log-normal model based on?
Log-transformation of the response variable (ln(Y))
What type of error does a log-normal model assume?
Multiplicative error (constant variance on log scale)
What is a log-link Gaussian GLM?
A model where ln(E[Y]) is linear in predictors
How is a log-link GLM interpreted?
A unit increase in X multiplies the expected Y by e^beta
What is the main difference between log-normal and log-link models?
Log-normal transforms data; log-link transforms the mean
What does additive vs multiplicative response mean?
Additive = equal differences; multiplicative = equal proportions
When is multiplicative modelling better?
When data vary proportionally or span several magnitudes