1/111
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Biostatistics (def)
Analysis of data derived from biological sciences and medicine
Statistical tests to analyze biological data
Describing (large/small/variable a trait is)
Testing (whether two things are related/different)
Predicting (whether an intervention is effective)
Categorical data (def, types)
Non-ordered (nominal): yes/no
Ordered (ordinal): ranked, agree/disagree/neutral
Nominal data (def, ex)
Categorical
Non-ordered, non-numerical, can’t be ranked
Yes or no
Eye color
Ordinal data (def, ex)
Categorical
Natural rank or order, but distances between are not known or equal
Poor, average, good
Education levels
Military ranks
Numeric data (def, types)
Scale
Discrete: number of siblings
Continuous/interval: height, BMI
Discrete data (def, ex)
Numeric
Countable, distinct values that can’t be broken down into smaller parts
Number of students
Clicks on a website
Continuous/interval data
Numeric
Values that can be infinitely divided, measured
Height
Weight
Temperature
Measured, not counted
Rank (def)
Relative position of a set of measurements
Rate (def, ex, notes)
Ratio of two quantities
Percentages, proportions, ratios
Zero as lowest meaningful point
Descriptive statistics (use, ex)
To present and describe data in a useful way
Show patterns and associations
Summary statistics, graphs
Inferential statistics (use, ex)
To draw conclusions about a population from a sample; taking action based on the data
Estimation
Hypothesis tests, statistical modelling
Box and whisker plot (type, notes)
Median, range, IQR, outliers
Numeric data
25% in each whisker
50% in the box
Presenting categorical data
Table
Bar chart
Pie chart (<6 slices)
Presenting numeric data
Histograms
Box and whisker plots
Mode (def, features)
Value that occurs most frequently
Useful for discrete or categorical data
Median (def, features)
Middle value
Useful for asymmetric or skewed data or outlying values
Mean (def, feat)
Average of data set (add up and divide by number of points)
Affected by extreme values
For symmetrical distributions
Fictitious
Only for interval data
Skew (left vs right)
Left peak: POSITIVE skew
Right peak: NEGATIVE skew
Lefty win :)
Range (def, feat)
Minimum and maximum values
Sensitive to extreme values
Interquartile range (def, feat)
Difference between 25th and 75th centiles
Range of middle 50% of data
Not influenced by extreme values; ok for skewed
Variance (def, feat)
Mean of the squared distances to the mean
Are data close to the mean or spread out?
Standard deviation (def, feat)
Square root of the variance
Symmetric data
Coefficient of variation (use, calc)
Compare different scales or measurements
Standard deviation divided by the mean; usually multiplied by 100 to get a percentage
P (def)
Central or expected value for a given outcome, proportion or probability
N (def)
Sample size
Variance and standard error (calc)
p(1-p)/n
Standard error is same but square rooted
Probability (def, calc)
Frequency of event in many many trials
Categorical variables
Number of events/total number of trials
Probability of two events (calcs)
Mutually exclusive: Add probabilities
Not sequential: Multiply probabilities
Binomial probabilities (def, notes)
Chance of getting a specific number of “successes” in a fixed number of independent trials
Only two outcomes (success/failure)
Constant success probability
Conditions for binomial probabilities
BINS
Binary outcomes
Independent trials
Number of trials (N)
Same probability of success
Normal distribution
Probability of a range of variables is represented by area under the curve, = 1
Mean/median/mode equal
Symmetrical around the mean
Standard deviation Normal distribution (percents)
68% +- 1 SD of mean
95% within 2
99.7% within 3
Normal distribution mean and SD
Mean (μ) = 0
SD (σ) = 1
Z value (def, equ)
Any Normal variable can be standardized to get a z-value
(value-mean value)/SD
Kurtosis (def, names)
How tailed/peaked the distribution is
High kurtosis = leptokurtic = fat tails and sharper peaks, more outliers
Low kurtosis = platykurtic = thinner tails, flatter peak, fewer outliers
Mesokurtic similar to Normal
Population and Sample (defs, issue)
Population: group of interest
Sample: group chosen to represent population
Issue: sample may not reflect population; repeated samples from same population may give different results
Population parameters (def)
What is of interest to determine
Sample estimates = sample statistics, point estimates of population parameters
Conventional notation
Population:
N = population size
μ = mean
σ = SD, with 2 = population variance
Sample:
n = sample size
x̄ = mean
sd = sample SD
sd2 = sample variance
Sampling distribution of the mean (def)
Mean and standard deviation of the sample means
Standard error of the mean (def, equ)
Standard deviation of the sampling distribution of the means
Used to comment on the population mean, not to describe the dispersion of values around a mean
SEM = sd/square root of n
Confidence intervals (def, equ)
Range of values within which the the statistic would fall 95% of the time
Indication of how good the sample mean is as an estimate of the population mean
Mean +- 1.96*SEM (for 95% CI)
Z score requisites
Random samples
Quantitative data
Variable Normally distributed
Sample size > 30
Z score interpretation
Positive Z = observation is >mean
Negative z = observation is <mean
Sampling distribution of a proportion (def, notes)
the distribution of all possible values of the proportion that could be obtained in repeated samples of the same size from the population of interest
has a shape, mean and SD
mean = π
Standard error of proportion
Standard deviation of the sampling distribution of the proportion
Gets smaller as sample size increases
Central limit theorem
Large enough sample size 30+ → distribution of sample means will approximate a normal distribution
Finite population correction factor
1-f
sampled most or all of the population
Poisson distribution (use, ex)
Model rare events that occur across time, i.e. 100 year floods
Less than 20 events
Rare health events like infant mortality, cancer
Confidence intervals for an age-adjusted rate
SE = R/sqaure root of N
R = age-adjusted rate
N = number of events (deaths)
CI: R+-1.96xSE
Pnorm
Converts a Z score into a probability
Find probability that a randomly selected value from a Normal distribution would be less than or equal to a specified value
Output = cumulative density function, area under the curve to the left of a Z score
Qnorm
Converts probability into a Z score
Find the specific value in a Normal distribution at which a given proportion of the distribution falls before
Output = quantile or value below which the given percentage of the distribution lies
Hypothesis (def)
A test of belief or set of rules from coming to a yes/no conclusion
Statistical hypothesis = statement of belief regarding the value of one or more population characteristics
Hypothesis (process)
State null hypothesis (Ho) → there is NO EFFECT
Apply a statistical test
Decide to accept or reject null hypothesis (p-value)
Interpret the test
T test (use, assumptions)
To test if two means are the same
One continuous, one categorical
Assumes:
distributions roughly symmetrical
Observations independent
Variances are similar
T statistic/value (def, equ)
How far the data are from the null hypothesis
A standardized difference in means
(mean 1 - mean 2)/SE of mean diff
T score (meaning, equ)
How many SEMs are we from the zero?
A standardized score for comparison
P value (def, threshold)
Probability of the observed test statistic if null is true
P>0.05, accept the null
P<0.05, reject the null
Cohen’s d (def, interp)
How many SDs separate the means of two groups
Standardized measure of effect size
Small = .2
Medium = .5
Large = .8
Effect size for t tests
Type I error
False positive
Rejecting the null when it’s true
Controlled by significance level α, usually .05 or 5%
Type II error
False negative
Failing to reject null when it’s false
Controlled through large enough sample size
β
Power of a test (def, equ)
Probability of correctly rejecting a null
1-β
Usually .2 or power 80%
Paired t test (use)
Need for a paired sample i.e. a pre- and post-treatment measurement on the same participants
Calculates the difference within pairs (change score) first (subtract pre from post), and take the mean
ANOVA
Analysis of variance'/how different the means are
T test for 2+ means
Alt: at least one mean is different
Effect size: Eta squared or partial eta squared
Between-group differences (ANOVA)
Using group means and grand mean
Sums of squared deviations, then mean squared deviations (between)
Within-group differences
Using raw values and their group mean
Sum of squared deviations, then mean squared deviations (within)
Mean squares (ANOVA, calc)
= Sums of squares/degrees of freedom
Degrees of freedom (ANOVA)
Between groups (df1) = number of groups - 1
Within groups (df2) = N total - number of groups
Total df = N total - 1
ANOVA interpretation
If significant (p>0.05), justification to go look at differences between groups to find weirdo using Tukey’s HSD post-hoc
ANOVA post-hoc tests (why, ex)
To control for type I error at 5% level
Usually Tukey’s HSD
Partial eta squared (what, which test, interp)
ANOVA
Proportion of the variance in outcome variable explained by the grouping factor
Small effect: 0.01
Medium: 0.06
Large: 0.15
ANOVA assumptions
Groups normally distributed
Groups have same variance (up to 2x ok)
Observations and groups independent
Ok with moderate violations of assumptions (robust) but worse if groups very different sizes/small sample sizes
Correlation (def)
Measures association between two CONTINUOUS numeric variables
Association not causation
Assumes a straight line is the true relationship
Pearson’s r (def, interp, assum)
Correlation coefficient
Ranges from -1 to 1
<0.5 = weak
.5-.7 = moderate
>.7 = strong
Assumes Normally distributed variables, straight line relationship
Scatterplot (set up)
Dependent variable on the Y
Predictor on the X
Spearman’s ρ and Kendall’s τ
Alternate correlation coefficients for non-Normal or ordinal variables
Non-parametrics
R2 (R squared) (def, use)
Square of Pearson’s r
Effect size for correlation; amount of overlap between the variables
If r2 = 0.25, 25% of the variance is shared between variables
Uses of correlation (3)
Association between two variables in observational studies
Validating a new test against a gold standard
Reliability of a test
Null hypothesis for correlation
There is no correlation, r = 0
Linear regression (def, equ)
Gives equation of the line of the association
y = mx+b or y=b0+b1x
m or b1 is slope
b or b0 is intercept
Interpreting the slope
Slope quantifies how different y is for a +1 unit difference in x
Least squares (def, equ)
Regression fits the “best line” where the distance squared from each data point to the line is kept as small as possible
slope = SSyx/SSx
Error/residuals in regression (def)
The distances from each point to the line (vertically)
Represents unexplained variability in Y
Software output of note for regression (4)
R2 value
Standard error of the estimate
Coefficients: Intercept and Slope, including P-values
Assumptions of linear regression (3)
Linear relationship (check using scatterplots)
Constant variance of the residuals (no wedge shape)
Residuals have Normal distribution
Stability of a regression line (depends on)
Sample size
<5 observations per predictor is unstable
10-15 at least, >20 ideally
100 plus the number of predictor variables
Chi-square test (use, output, null)
Compare proportions between two or more groups
Categorical and categorical
Give x2 (chi-squared) test statistic
Null = proportions in groups are the same; being active or not is independent of gender
Expected table (chi) (def, equ, assum)
What you would expect in the cells of a table if the null were true
= row total x column total/grand total
Should have N>=20 and no cell <=5 unless N>=40; if no, FIsher’s
Chi-squared degrees of freedom (how, equ)
Depends on size of table
df = (#rows-1)x(#columns-1)
Assumptions of chi-square test (3, additional/alternate tests)
Sample size large enough (Fisher’s exact test)
Independent data points, not pre-post (McNemar’s tests)
Distribution curve is continuous, while cell counts are discrete (Yate’s continuity correction?)
Complete follow-up (def, notes)
All participants followed to death
Length of survival known for everyone
Rare and difficult; data is often “censored” or lost to follow-up
Right-censored data (def)
Blind to after
We don’t know what happened after a particular time or when a future event happens
Left-censored data (def)
Blind to the past
We don’t know what happened before a particular time or when a past event happened
Variable time follow-up (def)
Participants aren’t followed to death
Analyzing variable-time studies (best to worst, 4)
Life table approach (optimum)
Using person-years as a unit of observation (acceptable but unrealistic)
Comparing N-year risk (biased if any incomplete FUP)
Comparing mean survivals (don’t work)
Clinical/actuarial life table (def, equ)
Gives survival probabilities over time
Probability of surviving from time 0 to time b = (prob of surviving 0 to a)*(prob of surviving a to b)
Kaplan Meier approach (def)
Shows survival probabilities over time (steps)
Splits periods at events (outcome aka death, withdrawal, etc)
Log-rank test (use)
Compare two+ groups for survival time
Median survival time (def, note)
Time point where 50% survival is reached
Only estimable when curves drop below 50% survival
Exploratory study objectives (3)
State of low knowledge
Baselines, natural history
Discovery of patterns/potential assocations
Confirmatory study objectives (3)
Find effect of a given magnitude
Control of type I and II errors
Possible population state/hypothesis test