Errors of Fit and Outlier Detection

Errors of fit are called residuals.
Residuals represent the separation of a data point from the mean, or how far away a data point falls from the predicted value.
A larger distance indicates a greater error.

Outliers are data points that belong to a different population than the one being studied.
Identifying and addressing outliers ensures accurate assessment of the population of interest.
The methods are applied consistently to all datasets using a pre-defined protocol.

Many methods exist to calculate outlying statistics, but the focus will be on four key measures.
Leverage Statistic: Measures how far away a data point falls on the horizontal (X) axis from the mean of X.
Z Residual Statistic: Measures how far away a data point falls on the vertical (Y) axis from the predicted Y.
Covariance Ratio: Measures how far away a score falls from the diagonal.
Mahalanobis Distance Statistic: Considers all three previous measures in one formula and is the primary method used.

The Mahalanobis distance statistic will be the primary tool for identifying outliers.
It is a well-researched and respected measure.
Tutorials will demonstrate how to run a regression using software to calculate the Mahalanobis distance.

Standardizing residuals provides more meaningful information.
A Z score indicates how many standard deviations a score is above or below the mean.
For example, a Z score of 1 means the score is one standard deviation above the mean, performing better than approximately 85% of the population.

Values of $1.96$ or $2$ can be used as a way to estimate how far values are from the population.
This relates to a $95\%$ confidence interval.
Standardizing helps identify when scores may belong to other populations.

The formula for standardized residuals is: $\frac{Y - \text{Predicted Y}}{\text{Standard Deviation of the Errors}}$

A scatter plot of residual scores (Z residual vs. Z predicted scores) is the first step in identifying outliers.
A box is drawn around the data, typically at $1.96$ (approximately 2) standard deviations from the mean in both the X and Y axes.
This corresponds to an alpha of $2.5\%$ at either end, indicating extreme scores.

Visual detection is required before applying statistical formulas to identify outliers.
This provides a prima facie case to avoid the appearance of fraudulent data modification.
If scores fall clearly outside the box, it supports the use of a statistical test to assess for outlying scores.

Identifying outliers allows for re-running the regression without these points.
This may provide a better indication of the real relationship within the population of interest.

The first step is the residual plot.
Bounding around $1.96$ on the horizontal and vertical axes indicates whether to proceed with outlier removal using a specific formula.

Running Mahalanobis distance calculations results in two new columns in the data: Mahalanobis distance for each score and the p-value of that score.
Data with p-values less than $0.001$ may be excluded.

Scores are assessed to determine if they are outliers based on residual, leverage, or covariance ratio.

Mahalanobis distance is an offshoot of the Euclidean approach.
Euclidean approach identifies the centroid (region marked with an X), and the centroid represents the main effect of the X variable on the Y variable.
Euclidean distance measures how far a score falls away in all directions.
Mahalanobis distance considers the regression line, measuring how far a score falls away along that line.

Mahalanobis distance is especially useful with multiple predictors (multiple X's).
The centroid represents the simultaneous consideration of all X's.

Removing outliers reduces the sample size, which can make it harder to achieve statistical significance due to reduced power.
However, maintaining good scientific practices and ensuring data accuracy are more important than achieving significance.

Removing a case can increase or decrease the $R^2$ and impact the F statistic.
The impact depends on the nature of the outlier (leverage, standardized residual).
High leverage and low standardized residual could decrease the $R^2$ .
Each case is unique.

The main goal is to clean the data properly so the data being pulled is accurate.

Removing a case impacts $R^2$ , which then impacts F. The F statistic helps identify if the regression is significant.
The F statistic is similar to $T^2$ .
$R^2$ tells us the magnitude of the effect size.
Removal of cases reduces the sample size and therefore, the F value, compared to the change in $R^2$ .

Outliers are complex, can increase or decrease $R^2$ and significance.
Maintaining good scientific methods ensure conclusions are valid and reliable.
Mahalanobis distance builds on Euclidean approaches but uses the line of best fit in conjunction with the centroid.