1/67
Linear, Logistic, Poisson, Survival Analysis
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Regression Analysis
to determine if 1 or more independent variables is associated with a dependent variable
Independent variable
Explanatory variable
Predictor variable
X
Dependent variable
Response variable
Outcome variable
Y
What is a statistical model?
The equation that describes the putative relationship among variables.
Multivariable analysis
Inferences based on the parameter for any independent variable are conditional on the other independent variables in the model.
Avoid omitting potential confounding while not including variables of minimal sequence.
Linear Regression
Outcome is measured on a CONTINUOUS scale i.e. body weight
Predictors can be measures on a continuous or categorical (dichotomous) scale
Linear Regression Example
Is chest girth (cm) significantly associated with body weight (kg) among heifers?
How do we determine the line that best fits our data?
The method of least squares is used to estimate the parameters (in this case, β0 and β1) in such a way as to minimize the sum of the squared residuals
What is a residual?
Used to estimate error in the model
The difference between an observed value of Y & its predicted value for a given value of X
T-test LR
beta/SE
Used to evaluate whether the predictor is significantly associated with the outcome
Denotes that the predictor explains the variation in the outcome
R²
How well predictor explains the outcome
It always goes up if new predictors are added
Adjusted R²
Its value is adjusted for the number of predictor variables (k) in the model
Will go down new predictors are added and have minimal additional impact on the outcome
Model Assumptions (linear regression)
Independence: the values of the outcome variable are independent from one another, i.e. no clustered data
Linearity: the relationship between the outcome and any continuous predictor variables is linear
Normal distribution: the residuals are normally distributed
Homoscedasticity: the variance of the residuals is the same across the range of predicted values of y
What if underlying assumptions are not met in a linear regression?
Independence and linearity assumptions are the most important
do data transformation like logarithmic
Can proceed as planned if there are moderate departures from normality & homoscedasticity
Cook’s Distance (Di)
assesses the influence of each observation
Standardized measure of the change in regression parameters if the particular observation was omitted
Collinearity
presence of highly correlated predictor variables in the model
Leads to t-test statistics that are spuriously small and thus p-values that are misleading
Assessed using variance inflation factor (VIF)
Variance inflation factor (VIF)
Measures how much the variance of regression coefficients in the model is inflated by addition of a predictor variable that contains very similar information
Values of VIF > 10 indicate serious collinearity
The SE of a regression parameter will ↑ by a factor of about the square root of VIF when a collinear predictor variable is added to the model
VIF = 1/(1 – R2X)
where R2X is the coefficient of determination for describing the amount of variance in the incoming X that is explained by the predictors already in the model
Logistic Regression
Outcome of interest is measured on a categorical scale
Usually dichotomous: yes/no, negative/positive, 0/1
Predictors can be measured on a continuous or categorical
can we use regression model for logistic regression?
no. as we would be unable to interpret any predicted values of Y other than 0 or 1
Generalized linear models (GLM)
Random component: identifies the outcome variable Y & selects a probability distribution for it, e.g. normal, binomial, Poisson, negative binomial
Systematic component: specifies the linear combination of predictor variables, e.g. β0 + β1X1
Link function: specifies a function that relates the expected value of Y to the linear combination of predictor variables, i.e. it connects the random & systematic components
▪ Gives us a linear relationship between our outcome variable & predictor(s)
Interpreting OR for continuous predictors
The factor by which the odds are ↑ (or ↓) for each unit change in the predictor
Maximum likelihood estimation
used to estimate the regression parameters
Wald chi-squared test
used to evaluate the significance of individual parameters
Model assumptions logistic regression
Independence: the observations are independent from one another
Linearity: the relationship between the outcome (i.e. ln{p/(1 – p)}) and any continuous predictor variables is linear
Goodness-of-fit statistics address the differences between observed & predicted values or their ratio
Pearson χ2
Deviance χ2
Hosmer-Lemeshow test
Pearson & deviance χ²
Based on dividing the data into covariate patterns
Within each pattern, the predicted # of outcomes is computed & compared to the observed # of outcomes to yield the Pearson & deviance residuals
The Pearson & deviance chi-squared statistics represent the sums of the respective squared residuals
Hosmer-Lemeshow test
Based on dividing the data in more arbitrary fashion, e.g.percentiles of estimated probability
Predicted & observed outcome probabilities within each group are compared as before
More reliable if the # of covariate patterns is high relative to the # of observations
Poisson Regression
Outcome of interest is measured on a discrete scale. e.g. # of cases of disease, # of deaths
Predictors can be measured on a continuous or categorical (including dichotomous) scale
Model assumptions
Independence: the observations are independent from one another
Linearity: the relationship between the outcome, i.e. ln (μ/N), & any continuous predictor variables is linear
Mean = variance
Overdispersion
Greater variability than expected for a GLM
Count data often vary more than would be expected if the response distribution was truly Poisson, i.e. the variance of the counts >> the mean
Data as counts
nosocomial infections
cases of cvd
workplace injuries
Problem of overdispersion
Model-based estimates of variance are too small
Thus, estimates of SE for model parameters might not be appropriate
Inference (based on Wald statistics and corresponding p-values) is questionable
Methods of managing overdispersion
Account for clustering if present
Compute scaled SEs of parameter estimates
Assume a more flexible distribution, e.g. negative binomial distribution
Indicators of overdispersion
Can be quantified by dividing the Pearson or deviance χ2 by its df
Pearson x2 usually performs better
Values > 1 indicate overdispersion
Scaled SEs
One approach is to refit the model while allowing the variance to have a multiplicative scaling (dispersion) factor
o The χ2 statistic (Pearson is preferred) divided by its df is used
o Scaled deviance and scaled Pearson χ2 values (original values divided by the dispersion factor) are generated
The SEs of the parameter estimates are scaled through
multiplication by the square root of the Pearson dispersion factor
Negative binomial regression
allows for larger variance than Poisson, i.e. the variance is not constrained to = the mean
Survival Analysis
Class of statistical methods for studying the occurrence & timing of events
Length of time that elapses before an event happens
Factors that are associated with the timing of that event
Death is often the event of interest, but these methods are broadly applicable
Also called time-to-event analysis
Event
Any qualitative change (transition from 1 discrete state to another) that can be situated in time
Quantitative changes are fine too if the threshold has relevance to real life
Timing
It is best (but not always possible) to know the exact timing of the event
Time origin should mark the onset of continuous exposure to risk of the event
Often this origin is unavailable, so must use a proxy
Seconds to years
Risk factors
We generally want to know if the risk of an event depends on various exposures
Censored data
Censored observations have unknown event times, e.g. due to loss to follow-up or termination of the study before event occurrence
Right censoring
The event has not been observed to occur
Left censoring
The event has already occurred before a given time (e.g. the start of the study)
Interval censoring
A combination of right & left, typically occurring when observations are made at infrequent intervals
Type I
Censoring that occurs because observation was stopped at a fixed time
Type II
Censoring that occurs because observation was stopped after a fixed number of events
Random
Censoring that occurs for reasons not under the control of the investigator, e.g. death or loss to follow-up
Our assumption is that censoring times are non-informative
Perform sensitivity analysis to assess
Survivor Function
Probability that an event time is > t
S(t) = P{T > t}
If event of interest is death, it’s the probability of surviving beyond time t
Estimated by proportion of individuals still alive at t
Hazard function
Can be thought of as the instantaneous risk of event occurrence at t, conditional on the event not occurring up to t
Interpreted as expected # of events per interval of time (for repeatable events)
Non-parametric Methods (Kaplan-Meier)
No assumptions about the distribution of event times or the relationship between predictors & the time until event
Straightforward
Estimates survivor functions (also called product-limit estimates)
Parametric Methods
The distribution of event times is expected to be known
Characterized by unique hazard functions
Exponential
Gompertz
Weibull
Semi-parametric
Cox proportional hazards
Kaplan meier is ideal for
Preliminary examination of data
Computing derived quantities from regression models (e.g. median survival time, 3-year probability of survival, etc.)
Comparing survivor functions between groups
Exponential model
Hazard is assumed to be constant over time
h(t) = λ
ln h(t) = μ
Unrealistic, but potentially useful as a first approximation even when assumption is known to be false. e.g. hazard of winning the lottery
Gompertz Model
The log of the hazard is a linear function of time
ln h(t) = μ + αt (where α is a constant)
e.g. hazard of retirement ↑ linearly with age; hazard of making work-related errors ↓ linearly with time spent in your career
The hazard may ↑ or ↓ with time but may not change direction
Weibull Model
The log of the hazard is a linear function of the log of time
ln h(t) = μ + αln(t) (where α is a constant)
Again, the hazard may ↑ or ↓ with time but only in one direction
Cox proportional Hazard Model (def)
The distribution of event times does not need to be specified, i.e. no need to specify how the hazard depends on time
Doesn't assume specific shape for baseline hazard
But assumes proportional hazards between groups
Flexible like non-parametric (Kaplan-Meier)
Powerful like parametric (Weibull/Gompertz)
Cox proportional Hazard Model (def)
The model can only be used to compare hazards between subjects, not predict the value of the hazard for any one subject
Model fit (likelihood-ratio test (LRT)
Compares the likelihood of the model (with all your covariates included)
with that of the null model (only α(t)
log-rank
when survival differences matter
long term treatment effects
chronic disease studies
Wilcoxon
when early differences are important
Acute treatments
early deaths matter more
when many people drop out late
testing survival functions
log-rank
wilcoxon