Goodness of Model Fit: Notes
Goodness of Model Fit
Simple Regression
- Simple regression is a mathematical description of the linear relationship between two variables, Y and X.
- The equation is represented as Y = bX + a, where:
- Y is the dependent variable.
- X is the independent variable.
- b is the slope, representing the predicted unit change in Y for each unit change in X.
- a is the intercept.
- A test of b determines the strength of the relationship between Y and X.
Topics Covered
- Introduction to the topic.
- Data cleaning techniques.
- Statistical inference.
- Descriptive and inferential correlation.
- Simple regression model.
- Goodness of model fit.
- Multiple regression.
Goodness of Model Fit
- Evaluates how well a linear equation represents a relationship.
- A scatterplot visually examines the fit of the regression model.
- Similar to how a histogram assesses the appropriateness of the mean as a measure of central tendency.
- Factors affecting the regression model:
- Outliers.
- Restricted range.
- Heterogeneous subsamples ('clumps').
Errors of Fit (Residuals)
- Errors of fit are called residuals, denoted as e.
- Calculated as the difference between the observed value (Y) and the predicted value (\hat{Y}): e = Y - \hat{Y}.
- Errors of fit are deviations from Y and X and leverages residuals
Standardized Score
- A score is "standardized" when it is expressed as a deviation from its mean, divided by the standard deviation (SD).
- Formula: z = \frac{x - \bar{X}}{S_x}, where:
- \bar{X} is the mean of x.
- S_x is the standard deviation of x.
- A standardized score of z = +1 means the score is 1 SD higher than the mean.
Standardized Residuals
- Formula: z = \frac{Y - \hat{Y}}{S_{Y.X}}, where:
- \hat{Y} is the predicted score of Y.
- S_{Y.X} is the standard deviation of errors (Y - \hat{Y}).
Residual Plot Analysis
- Visually examine the residual plot for outlying residuals.
- Justifies proceeding to detailed residual analysis.
- The justification required is the occurrence of outlying residuals beyond the bounds of z \pm 1.96 (or roughly 2) on either the Y or X axis.
Residual Statistics and Limits
- Many residual statistics exist.
- Four main types:
- Standardized residuals beyond \pm 1.96
- Leverages > 2(p+1)/N
- Covariance ratios beyond 1 \pm 3(p+1)/N
- Mahalanobis distance p < .001
- Where: p = number of predictors, N = sample size, probability of H_0
Effect of Outlier Removal on R^2
- Removing an outlier may increase or decrease R^2 and F statistics, depending on how the outlier influenced the results prior to its removal.
- Removing a case high in leverage but low in standardized residual would decrease R^2, whereas the converse would increase R^2.
- Cases with very low or very high covariance ratios have to be assessed on a case-by-case basis.
Effect of Outlier Removal on F
- The F value is dependent on R^2.
- Removal of cases reduces the sample size and, therefore, the F value.
- However, this effect is relatively small compared to a change in R^2.
Factors Affecting Regression
- Outliers.
- Restricted range.
- Heterogeneous subsamples ('clumps').
Sampling from a Restricted Range
- Sampling from a restricted range can skew the results of a regression analysis.
Heterogeneous Subsamples
- Heterogeneous subsamples can also affect the regression.
- When there are multiple subgroups within the data, each with different relationships between the variables, the overall regression model may not accurately represent any of the subgroups.
- Subsample 1 may show a strong relationship, while subsample 2 may show a weak relationship.
Assumptions
- Using residual plots to assess:
- Independence (design issue).
- Linearity (no ‘bends’).
- Bivariate Normality (even densities of residuals).
- Homoscedasticity (homogeneity of variance).
Assessing Assumptions Using Residual Plots
- Independence: The errors should be independent of each other. This is primarily a design issue.
- Linearity: The relationship between the independent and dependent variables should be linear. Look for no 'bends' in the residual plot.
- Bivariate Normality: The residuals should be normally distributed around zero for all values of the independent variable. This implies that the densities of residuals are even.
- Homoscedasticity: The variance of the errors should be constant across all levels of the independent variable. Look for even spreads in the residual plot.
Violations of Assumptions
- If assumptions are met, there should be a random distribution of residuals.
- If not random, look for trends to suggest what assumptions are being violated:
- Curvilinear bends (non-linearity).
- Denser clumps to one side, or above or below the zero line (non-normality).
- Fanning out to one side (heterogeneity of variance).
Stating Violated Assumptions
- If violations are not evident, then make a statement like: "the residuals appear to be randomly distributed, with no bends (linearity), no denser portions above or below (or on one side) of the zero line (normality) and no fanning out to one side (homogeneity of variance)"
- State this in your own words, otherwise you risk being accused of plagiarism.
Summary
- Factors affecting regression (adequacy of the fit):
- Outliers.
- Restricted range.
- Heterogeneous subsamples (‘clumps’).
- Assumptions:
- Independence (design issue).
- Linearity (‘bends’).
- Bivariate Normality (uneven densities).
- Homoscedasticity (uneven spreads).