Errors of Fit and Outlier Detection

Errors of Fit and Residuals

  • Errors of fit are called residuals.

  • Residuals represent the separation of a data point from the mean, or how far away a data point falls from the predicted value.

  • A larger distance indicates a greater error.

Outliers

  • Outliers are data points that belong to a different population than the one being studied.

  • Identifying and addressing outliers ensures accurate assessment of the population of interest.

  • The methods are applied consistently to all datasets using a pre-defined protocol.

Calculating Outlying Statistics

  • Many methods exist to calculate outlying statistics, but the focus will be on four key measures.

  • Leverage Statistic: Measures how far away a data point falls on the horizontal (X) axis from the mean of X.

  • Z Residual Statistic: Measures how far away a data point falls on the vertical (Y) axis from the predicted Y.

  • Covariance Ratio: Measures how far away a score falls from the diagonal.

  • Mahalanobis Distance Statistic: Considers all three previous measures in one formula and is the primary method used.

Mahalanobis Distance

  • The Mahalanobis distance statistic will be the primary tool for identifying outliers.

  • It is a well-researched and respected measure.

  • Tutorials will demonstrate how to run a regression using software to calculate the Mahalanobis distance.

Standardized Residuals

  • Standardizing residuals provides more meaningful information.

  • A Z score indicates how many standard deviations a score is above or below the mean.

  • For example, a Z score of 1 means the score is one standard deviation above the mean, performing better than approximately 85% of the population.

Using 1.96 or 2 as a Standard Deviation

  • Values of 1.96 or 2 can be used as a way to estimate how far values are from the population.

  • This relates to a 95\% confidence interval.

  • Standardizing helps identify when scores may belong to other populations.

Formula for Standardized Residuals

  • The formula for standardized residuals is: \frac{Y - \text{Predicted Y}}{\text{Standard Deviation of the Errors}}

Scatter Plot of Residual Scores

  • A scatter plot of residual scores (Z residual vs. Z predicted scores) is the first step in identifying outliers.

  • A box is drawn around the data, typically at 1.96 (approximately 2) standard deviations from the mean in both the X and Y axes.

  • This corresponds to an alpha of 2.5\% at either end, indicating extreme scores.

Visual Inspection and Prima Facie Case

  • Visual detection is required before applying statistical formulas to identify outliers.

  • This provides a prima facie case to avoid the appearance of fraudulent data modification.

  • If scores fall clearly outside the box, it supports the use of a statistical test to assess for outlying scores.

Regression without Outliers

  • Identifying outliers allows for re-running the regression without these points.

  • This may provide a better indication of the real relationship within the population of interest.

Steps in Residual Plot

  • The first step is the residual plot.

  • Bounding around 1.96 on the horizontal and vertical axes indicates whether to proceed with outlier removal using a specific formula.

Four Approaches

  • Leverage

  • Z Residuals

  • Covariance Ratio

  • Mahalanobis Distance

Using Mahalanobis Distance Calculations

  • Running Mahalanobis distance calculations results in two new columns in the data: Mahalanobis distance for each score and the p-value of that score.

  • Data with p-values less than 0.001 may be excluded.

Identifying the Basis of Outliers

  • Scores are assessed to determine if they are outliers based on residual, leverage, or covariance ratio.

Concepts Behind Mahalanobis Distance

  • Mahalanobis distance is an offshoot of the Euclidean approach.

  • Euclidean approach identifies the centroid (region marked with an X), and the centroid represents the main effect of the X variable on the Y variable.

  • Euclidean distance measures how far a score falls away in all directions.

  • Mahalanobis distance considers the regression line, measuring how far a score falls away along that line.

Multiple Predictors

  • Mahalanobis distance is especially useful with multiple predictors (multiple X's).

  • The centroid represents the simultaneous consideration of all X's.

Effect on Significance

  • Removing outliers reduces the sample size, which can make it harder to achieve statistical significance due to reduced power.

  • However, maintaining good scientific practices and ensuring data accuracy are more important than achieving significance.

Impact of Removing a Case

  • Removing a case can increase or decrease the R^2 and impact the F statistic.

  • The impact depends on the nature of the outlier (leverage, standardized residual).

  • High leverage and low standardized residual could decrease the R^2.

  • Each case is unique.

Main Goal

  • The main goal is to clean the data properly so the data being pulled is accurate.

Effects of F statistic

  • Removing a case impacts R^2, which then impacts F. The F statistic helps identify if the regression is significant.

  • The F statistic is similar to T^2.

  • R^2 tells us the magnitude of the effect size.

  • Removal of cases reduces the sample size and therefore, the F value, compared to the change in R^2.

Conclusion

  • Outliers are complex, can increase or decrease R^2 and significance.

  • Maintaining good scientific methods ensure conclusions are valid and reliable.

  • Mahalanobis distance builds on Euclidean approaches but uses the line of best fit in conjunction with the centroid.