Advanced Epi Methods

0.0(0)
studied byStudied by 0 people
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/67

flashcard set

Earn XP

Description and Tags

Linear, Logistic, Poisson, Survival Analysis

Last updated 5:23 PM on 4/13/25
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

68 Terms

1
New cards

Regression Analysis

to determine if 1 or more independent variables is associated with a dependent variable

2
New cards

Independent variable

Explanatory variable

Predictor variable

X

3
New cards

Dependent variable

Response variable

Outcome variable

Y

4
New cards

What is a statistical model?

The equation that describes the putative relationship among variables.

5
New cards

Multivariable analysis

Inferences based on the parameter for any independent variable are conditional on the other independent variables in the model.

Avoid omitting potential confounding while not including variables of minimal sequence.

6
New cards

Linear Regression

Outcome is measured on a CONTINUOUS scale i.e. body weight

Predictors can be measures on a continuous or categorical (dichotomous) scale

7
New cards

Linear Regression Example

Is chest girth (cm) significantly associated with body weight (kg) among heifers?

8
New cards

How do we determine the line that best fits our data?

The method of least squares is used to estimate the parameters (in this case, β0 and β1) in such a way as to minimize the sum of the squared residuals

9
New cards

What is a residual?

Used to estimate error in the model

The difference between an observed value of Y & its predicted value for a given value of X

10
New cards

T-test LR

beta/SE

Used to evaluate whether the predictor is significantly associated with the outcome

Denotes that the predictor explains the variation in the outcome

11
New cards

How well predictor explains the outcome

It always goes up if new predictors are added

12
New cards

Adjusted R²

Its value is adjusted for the number of predictor variables (k) in the model

Will go down new predictors are added and have minimal additional impact on the outcome

13
New cards

Model Assumptions (linear regression)

Independence: the values of the outcome variable are independent from one another, i.e. no clustered data

Linearity: the relationship between the outcome and any continuous predictor variables is linear

Normal distribution: the residuals are normally distributed

Homoscedasticity: the variance of the residuals is the same across the range of predicted values of y

14
New cards

What if underlying assumptions are not met in a linear regression?

Independence and linearity assumptions are the most important

do data transformation like logarithmic

Can proceed as planned if there are moderate departures from normality & homoscedasticity

15
New cards

Cook’s Distance (Di)

assesses the influence of each observation

Standardized measure of the change in regression parameters if the particular observation was omitted

16
New cards

Collinearity

presence of highly correlated predictor variables in the model

Leads to t-test statistics that are spuriously small and thus p-values that are misleading

Assessed using variance inflation factor (VIF)

17
New cards

Variance inflation factor (VIF)

Measures how much the variance of regression coefficients in the model is inflated by addition of a predictor variable that contains very similar information

Values of VIF > 10 indicate serious collinearity

The SE of a regression parameter will ↑ by a factor of about the square root of VIF when a collinear predictor variable is added to the model

18
New cards

VIF = 1/(1 – R2X)

where R2X is the coefficient of determination for describing the amount of variance in the incoming X that is explained by the predictors already in the model

19
New cards

Logistic Regression

Outcome of interest is measured on a categorical scale

Usually dichotomous: yes/no, negative/positive, 0/1

Predictors can be measured on a continuous or categorical

20
New cards

can we use regression model for logistic regression?

no. as we would be unable to interpret any predicted values of Y other than 0 or 1

21
New cards

Generalized linear models (GLM)

Random component: identifies the outcome variable Y & selects a probability distribution for it, e.g. normal, binomial, Poisson, negative binomial

Systematic component: specifies the linear combination of predictor variables, e.g. β0 + β1X1

Link function: specifies a function that relates the expected value of Y to the linear combination of predictor variables, i.e. it connects the random & systematic components

Gives us a linear relationship between our outcome variable & predictor(s)

22
New cards

Interpreting OR for continuous predictors

The factor by which the odds are ↑ (or ↓) for each unit change in the predictor

23
New cards

Maximum likelihood estimation

used to estimate the regression parameters

24
New cards

Wald chi-squared test

used to evaluate the significance of individual parameters

25
New cards

Model assumptions logistic regression

Independence: the observations are independent from one another

Linearity: the relationship between the outcome (i.e. ln{p/(1 – p)}) and any continuous predictor variables is linear

26
New cards

Goodness-of-fit statistics address the differences between observed & predicted values or their ratio

Pearson χ2

Deviance χ2

Hosmer-Lemeshow test

27
New cards

Pearson & deviance χ²

Based on dividing the data into covariate patterns

Within each pattern, the predicted # of outcomes is computed & compared to the observed # of outcomes to yield the Pearson & deviance residuals

The Pearson & deviance chi-squared statistics represent the sums of the respective squared residuals

28
New cards

Hosmer-Lemeshow test

Based on dividing the data in more arbitrary fashion, e.g.percentiles of estimated probability

Predicted & observed outcome probabilities within each group are compared as before

More reliable if the # of covariate patterns is high relative to the # of observations

29
New cards

Poisson Regression

Outcome of interest is measured on a discrete scale. e.g. # of cases of disease, # of deaths

Predictors can be measured on a continuous or categorical (including dichotomous) scale

30
New cards

Model assumptions

Independence: the observations are independent from one another

Linearity: the relationship between the outcome, i.e. ln (μ/N), & any continuous predictor variables is linear

Mean = variance

31
New cards

Overdispersion

Greater variability than expected for a GLM

Count data often vary more than would be expected if the response distribution was truly Poisson, i.e. the variance of the counts >> the mean

32
New cards

Data as counts

nosocomial infections

cases of cvd

workplace injuries

33
New cards

Problem of overdispersion

Model-based estimates of variance are too small

Thus, estimates of SE for model parameters might not be appropriate

Inference (based on Wald statistics and corresponding p-values) is questionable

34
New cards

Methods of managing overdispersion

Account for clustering if present

Compute scaled SEs of parameter estimates

Assume a more flexible distribution, e.g. negative binomial distribution

35
New cards

Indicators of overdispersion

Can be quantified by dividing the Pearson or deviance χ2 by its df

Pearson x2 usually performs better

Values > 1 indicate overdispersion

36
New cards

Scaled SEs

One approach is to refit the model while allowing the variance to have a multiplicative scaling (dispersion) factor

o The χ2 statistic (Pearson is preferred) divided by its df is used

o Scaled deviance and scaled Pearson χ2 values (original values divided by the dispersion factor) are generated

37
New cards

The SEs of the parameter estimates are scaled through

multiplication by the square root of the Pearson dispersion factor

38
New cards

Negative binomial regression

allows for larger variance than Poisson, i.e. the variance is not constrained to = the mean

39
New cards

Survival Analysis

Class of statistical methods for studying the occurrence & timing of events

Length of time that elapses before an event happens

Factors that are associated with the timing of that event

Death is often the event of interest, but these methods are broadly applicable

Also called time-to-event analysis

40
New cards

Event

Any qualitative change (transition from 1 discrete state to another) that can be situated in time

Quantitative changes are fine too if the threshold has relevance to real life

41
New cards

Timing

It is best (but not always possible) to know the exact timing of the event

Time origin should mark the onset of continuous exposure to risk of the event

Often this origin is unavailable, so must use a proxy

Seconds to years

42
New cards

Risk factors

We generally want to know if the risk of an event depends on various exposures

43
New cards

Censored data

Censored observations have unknown event times, e.g. due to loss to follow-up or termination of the study before event occurrence

44
New cards

Right censoring

The event has not been observed to occur

45
New cards

Left censoring

The event has already occurred before a given time (e.g. the start of the study)

46
New cards

Interval censoring

A combination of right & left, typically occurring when observations are made at infrequent intervals

47
New cards

Type I

Censoring that occurs because observation was stopped at a fixed time

48
New cards

Type II

Censoring that occurs because observation was stopped after a fixed number of events

49
New cards

Random

Censoring that occurs for reasons not under the control of the investigator, e.g. death or loss to follow-up

Our assumption is that censoring times are non-informative

Perform sensitivity analysis to assess

50
New cards

Survivor Function

Probability that an event time is > t

S(t) = P{T > t}

If event of interest is death, it’s the probability of surviving beyond time t

Estimated by proportion of individuals still alive at t

51
New cards

Hazard function

Can be thought of as the instantaneous risk of event occurrence at t, conditional on the event not occurring up to t

Interpreted as expected # of events per interval of time (for repeatable events)

52
New cards

Non-parametric Methods (Kaplan-Meier)

No assumptions about the distribution of event times or the relationship between predictors & the time until event

Straightforward

Estimates survivor functions (also called product-limit estimates)

53
New cards

Parametric Methods

The distribution of event times is expected to be known

Characterized by unique hazard functions

Exponential

Gompertz

Weibull

54
New cards

Semi-parametric

Cox proportional hazards

55
New cards

Kaplan meier is ideal for

Preliminary examination of data

Computing derived quantities from regression models (e.g. median survival time, 3-year probability of survival, etc.)

Comparing survivor functions between groups

56
New cards

Exponential model

Hazard is assumed to be constant over time

h(t) = λ

ln h(t) = μ

Unrealistic, but potentially useful as a first approximation even when assumption is known to be false. e.g. hazard of winning the lottery

57
New cards

Gompertz Model

The log of the hazard is a linear function of time

ln h(t) = μ + αt (where α is a constant)

e.g. hazard of retirement ↑ linearly with age; hazard of making work-related errors ↓ linearly with time spent in your career

The hazard may ↑ or ↓ with time but may not change direction

58
New cards

Weibull Model

The log of the hazard is a linear function of the log of time

ln h(t) = μ + αln(t) (where α is a constant)

Again, the hazard may ↑ or ↓ with time but only in one direction

59
New cards

Cox proportional Hazard Model (def)

The distribution of event times does not need to be specified, i.e. no need to specify how the hazard depends on time

Doesn't assume specific shape for baseline hazard

But assumes proportional hazards between groups

Flexible like non-parametric (Kaplan-Meier)

Powerful like parametric (Weibull/Gompertz)

60
New cards

Cox proportional Hazard Model (def)

The model can only be used to compare hazards between subjects, not predict the value of the hazard for any one subject

61
New cards

Model fit (likelihood-ratio test (LRT)

Compares the likelihood of the model (with all your covariates included)

with that of the null model (only α(t)

62
New cards

log-rank

when survival differences matter

long term treatment effects

chronic disease studies

63
New cards

Wilcoxon

when early differences are important

Acute treatments

early deaths matter more

when many people drop out late

64
New cards

testing survival functions

log-rank

wilcoxon

65
New cards
66
New cards
67
New cards
68
New cards