Errors of Fit and Outlier Detection
Errors of Fit and Residuals
Errors of fit are called residuals.
Residuals represent the separation of a data point from the mean, or how far away a data point falls from the predicted value.
A larger distance indicates a greater error.
Outliers
Outliers are data points that belong to a different population than the one being studied.
Identifying and addressing outliers ensures accurate assessment of the population of interest.
The methods are applied consistently to all datasets using a pre-defined protocol.
Calculating Outlying Statistics
Many methods exist to calculate outlying statistics, but the focus will be on four key measures.
Leverage Statistic: Measures how far away a data point falls on the horizontal (X) axis from the mean of X.
Z Residual Statistic: Measures how far away a data point falls on the vertical (Y) axis from the predicted Y.
Covariance Ratio: Measures how far away a score falls from the diagonal.
Mahalanobis Distance Statistic: Considers all three previous measures in one formula and is the primary method used.
Mahalanobis Distance
The Mahalanobis distance statistic will be the primary tool for identifying outliers.
It is a well-researched and respected measure.
Tutorials will demonstrate how to run a regression using software to calculate the Mahalanobis distance.
Standardized Residuals
Standardizing residuals provides more meaningful information.
A Z score indicates how many standard deviations a score is above or below the mean.
For example, a Z score of 1 means the score is one standard deviation above the mean, performing better than approximately 85% of the population.
Using 1.96 or 2 as a Standard Deviation
Values of 1.96 or 2 can be used as a way to estimate how far values are from the population.
This relates to a 95\% confidence interval.
Standardizing helps identify when scores may belong to other populations.
Formula for Standardized Residuals
The formula for standardized residuals is: \frac{Y - \text{Predicted Y}}{\text{Standard Deviation of the Errors}}
Scatter Plot of Residual Scores
A scatter plot of residual scores (Z residual vs. Z predicted scores) is the first step in identifying outliers.
A box is drawn around the data, typically at 1.96 (approximately 2) standard deviations from the mean in both the X and Y axes.
This corresponds to an alpha of 2.5\% at either end, indicating extreme scores.
Visual Inspection and Prima Facie Case
Visual detection is required before applying statistical formulas to identify outliers.
This provides a prima facie case to avoid the appearance of fraudulent data modification.
If scores fall clearly outside the box, it supports the use of a statistical test to assess for outlying scores.
Regression without Outliers
Identifying outliers allows for re-running the regression without these points.
This may provide a better indication of the real relationship within the population of interest.
Steps in Residual Plot
The first step is the residual plot.
Bounding around 1.96 on the horizontal and vertical axes indicates whether to proceed with outlier removal using a specific formula.
Four Approaches
Leverage
Z Residuals
Covariance Ratio
Mahalanobis Distance
Using Mahalanobis Distance Calculations
Running Mahalanobis distance calculations results in two new columns in the data: Mahalanobis distance for each score and the p-value of that score.
Data with p-values less than 0.001 may be excluded.
Identifying the Basis of Outliers
Scores are assessed to determine if they are outliers based on residual, leverage, or covariance ratio.
Concepts Behind Mahalanobis Distance
Mahalanobis distance is an offshoot of the Euclidean approach.
Euclidean approach identifies the centroid (region marked with an X), and the centroid represents the main effect of the X variable on the Y variable.
Euclidean distance measures how far a score falls away in all directions.
Mahalanobis distance considers the regression line, measuring how far a score falls away along that line.
Multiple Predictors
Mahalanobis distance is especially useful with multiple predictors (multiple X's).
The centroid represents the simultaneous consideration of all X's.
Effect on Significance
Removing outliers reduces the sample size, which can make it harder to achieve statistical significance due to reduced power.
However, maintaining good scientific practices and ensuring data accuracy are more important than achieving significance.
Impact of Removing a Case
Removing a case can increase or decrease the R^2 and impact the F statistic.
The impact depends on the nature of the outlier (leverage, standardized residual).
High leverage and low standardized residual could decrease the R^2.
Each case is unique.
Main Goal
The main goal is to clean the data properly so the data being pulled is accurate.
Effects of F statistic
Removing a case impacts R^2, which then impacts F. The F statistic helps identify if the regression is significant.
The F statistic is similar to T^2.
R^2 tells us the magnitude of the effect size.
Removal of cases reduces the sample size and therefore, the F value, compared to the change in R^2.
Conclusion
Outliers are complex, can increase or decrease R^2 and significance.
Maintaining good scientific methods ensure conclusions are valid and reliable.
Mahalanobis distance builds on Euclidean approaches but uses the line of best fit in conjunction with the centroid.