1/37
Linear, Logistic, Poisson, Survival Analysis
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Regression Analysis
to determine if 1 or more independent variables is associated with a dependent variable
Independent variable
Explanatory variable
Predictor variable
X
Dependent variable
Response variable
Outcome variable
Y
What is a statistical model?
The equation that describes the putative relationship among variables.
Multivariable analysis
Inferences based on the parameter for any independent variable are conditional on the other independent variables in the model.
Avoid omitting potential confounding while not including variables of minimal sequence.
Linear Regression
Outcome is measured on a CONTINUOUS scale i.e. body weight
Predictors can be measures on a continuous or categorical (dichotomous) scale
Linear Regression Example
Is chest girth (cm) significantly associated with body weight (kg) among heifers?
How do we determine the line that best fits our data?
The method of least squares is used to estimate the parameters (in this case, β0 and β1) in such a way as to minimize the sum of the squared residuals
What is a residual?
Used to estimate error in the model
The difference between an observed value of Y & its predicted value for a given value of X
T-test LR
β/SE
Used to evaluate whether the predictor is significantly associated with the outcome
H0: β=0
HA: β≠0
A significant t value denotes that the predictor explains some of the variation in the outcome
R²
Describes the proportion of variance in the outcome variable that is explained by the predictor(s)
It always ↑ as predictors are added to the model (thus can’t be used for variable selection
Adjusted R²
Its value is adjusted for the number of predictor variables (k) in the model
Will ↓ if added predictors have minimal additional impact on the outcome
Model Assumptions (LR)
Independence: the values of the outcome variable are independent from one another, i.e. no clustered data
Linearity: the relationship between the outcome and any continuous predictor variables is linear
Normal distribution: the residuals are normally distributed
Homoscedasticity: the variance of the residuals is the same across the range of predicted values of y
What if underlying assumptions are not met?
Independence & linearity assumptions are the most important
Can do data transformation, e.g. logarithmic
Can proceed as planned if there are moderate departures from normality & homoscedasticity
Cook’s Distance (Di)
assesses the influence of each observation
Standardized measure of the change in regression parameters if the particular observation was omitted
Collinearity
presence of highly correlated predictor variables in the model
Results in large standard errors of regression parameters
Leads to t-test statistics that are spuriously small and thus p-values that are
misleading
Assessed using variance inflation factor (VIF)
Variance inflation factor (VIF)
Measures how much the variance of regression coefficients in the model is inflated by addition of a predictor variable that contains very similar information
Values of VIF > 10 indicate serious collinearity
The SE of a regression parameter will ↑ by a factor of about the square root of VIF when a collinear predictor variable is added to the model
VIF = 1/(1 – R2X)
where R2X is the coefficient of determination for describing the amount of variance in the incoming X that is explained by the predictors already in the model
Logistic Regression
Outcome of interest is measured on a categorical scale
Usually dichotomous: yes/no, negative/positive, 0/1
Predictors can be measured on a continuous or categorical
can we use regression model for logistic regression?
no. as we would be unable to interpret any predicted values of Y other than 0 or 1
Generalized linear models (GLM)
Random component: identifies the outcome variable Y & selects a probability distribution for it, e.g. normal, binomial, Poisson, negative binomial
Systematic component: specifies the linear combination of predictor variables, e.g. β0 + β1X1
Link function: specifies a function that relates the expected value of Y to the linear combination of predictor variables, i.e. it connects the random & systematic components
â–Ş Gives us a linear relationship between our outcome variable & predictor(s)
Interpreting OR for continuous predictors
The factor by which the odds are ↑ (or ↓) for each unit change in the predictor
Maximum likelihood estimation
used to estimate the regression parameters
Wald chi-squared test
used to evaluate the significance of individual parameters
Model assumptions logistic regression
Independence: the observations are independent from one another
Linearity: the relationship between the outcome (i.e. ln{p/(1 – p)}) and any continuous predictor variables is linear
Goodness-of-fit statistics address the differences between observed & predicted values or their ratio
Pearson χ2
Deviance χ2
Hosmer-Lemeshow test
Pearson & deviance χ²
Based on dividing the data into covariate patterns
Within each pattern, the predicted # of outcomes is computed & compared to the observed # of outcomes to yield the Pearson & deviance residuals
The Pearson & deviance chi-squared statistics represent the sums of the respective squared residuals
Hosmer-Lemeshow test
Based on dividing the data in more arbitrary fashion, e.g.percentiles of estimated probability
Predicted & observed outcome probabilities within each group are compared as before
More reliable if the # of covariate patterns is high relative to the # of observations
Poisson Regression
Outcome of interest is measured on a discrete scale. e.g. # of cases of disease, # of deaths
Predictors can be measured on a continuous or categorical (including dichotomous) scale
Model assumptions
Independence: the observations are independent from one another
Linearity: the relationship between the outcome, i.e. ln (ÎĽ/N), & any continuous predictor variables is linear
Mean = variance