Outlier Analysis Notes
Outliers: Detection, Effects, and Solutions
Definition of Outliers
- Outliers are unusual values in a dataset that can distort statistical analyses and violate their assumptions.
- Analysts must decide how to handle outliers.
- Removing outliers is only appropriate for specific reasons.
- Outliers can provide valuable information about the subject area and data collection process.
- It's crucial to understand how outliers occur and whether they might reoccur as a normal part of the process or study area.
- Resisting the temptation to remove outliers inappropriately can be challenging because they increase variability and reduce statistical power.
- Excluding outliers can artificially cause results to become statistically significant.
Finding Outliers
- Outliers are data points far from other data points, representing unusual values.
- They can cause statistical tests to miss significant findings or distort real results.
- No strict statistical rules exist for definitively identifying outliers.
- Identifying outliers depends on subject-area knowledge and an understanding of the data collection process.
- While there's no solid mathematical definition, guidelines and statistical tests can help identify potential outlier candidates.
Impact of Outliers
- Outliers are notably different values that can cause problems in statistical procedures.
- A single outlier can significantly affect results.
Example Dataset
- A dataset of 15 height measurements of human males is used, including one outlier.
- The table below shows the mean height and standard deviation WITH and WITHOUT the outlier:
| WITH Outlier | WITHOUT Outlier | Difference (in m) | |
|---|---|---|---|
| Mean | 2.40 | 1.80 | 0.60 |
| Standard Deviation | 2.33 | 0.14 | 2.16 |
- A single value changes
- the mean height by 0.6m (2 feet)
- the standard deviation by 2.16m (7 feet).
- Hypothesis tests using the mean with the outlier will be inaccurate.
- The larger standard deviation will severely reduce statistical power.
- Potential outliers should be identified before performing statistical analyses.
Ways to Find Outliers
- Various methods exist for finding values that are unusual compared to the rest of the dataset.
Sorting Your Datasheet
- Sorting the datasheet for each variable helps highlight unusual values.
- It allows a quick identification of unusually high or low values, though it doesn't quantify the outlier's degree of unusualness.
Graphing Your Data
- Boxplots, histograms, and scatterplots can highlight outliers.
Boxplots
- Boxplots explicitly indicate outliers using asterisks or other symbols, based on the interquartile method with fences.
- Boxplots can also be used to find outliers when you have groups in your data.
Histograms
- Histograms emphasize the existence of outliers by displaying isolated bars.
Scatterplots
- Scatterplots can detect outliers in a multivariate setting.
- An observation can be an outlier because it doesn’t fit the model, even if its individual input or output values are not unusual.
- This type of outlier can be a problem in regression analysis.
- Multivariate regression has numerous types of outliers.
Causes of Outliers
- Outliers can arise from various sources, including errors and natural variation.
Data Entry and Measurement Errors
- Errors can occur during measurement and data entry.
- Typos can produce weird values.
- If an outlier value is determined to be an error, correct it if possible.
- If correction is not possible, the data point must be deleted because it’s known to be incorrect.
Sampling Problems
- Inferential statistics use samples to draw conclusions about a specific population.
- Studies should carefully define a population and draw a random sample from it.
- A study might accidentally obtain an item or person that is not from the target population.
- Unusual events or characteristics can occur that deviate from the defined population.
- The experimenter might measure the item or subject under abnormal conditions.
- It's possible to accidentally collect an item that falls outside the target population.
Examples of Sampling Problems
- A study assessing product strength defines the population as the output of the standard manufacturing process. Abnormal conditions (e.g., power failure, machine setting drifting) can cause outliers.
- These outliers can be legitimately removed because they do not reflect the target population.
- In a bone density study, a subject with diabetes was identified as an outlier and excluded because the study aimed to model bone density growth in pre-adolescent girls without health conditions affecting bone growth.
- If it can be established that an item or person does not represent the target population, the data point can be removed, provided there is a specific cause or reason.
Natural Variation
- Natural variation can also produce outliers, which is not necessarily a problem.
- All data distributions have a spread of values, and extreme values can occur with lower probabilities.
- Large sample sizes are more likely to contain unusual values.
- A normal distribution will have approximately 1 in 340 observations at least three standard deviations away from the mean.
- The process or population being studied might naturally produce weird values, which are a normal part of the data distribution.
Example of Natural Variation Causing an Outlier
- A model using historical U.S. Presidential approval ratings to predict historian ranks is affected by President Truman, who had an extremely low approval rating but a relatively good historian rank.
- Removing this data point improves the model fit but is not justifiable because it reflects the potential surprises and uncertainty inherent in the political system.
- It’s bad practice to remove data points simply to produce a better fitting model or statistically significant results.
- If an extreme value is a legitimate observation and a natural part of the population, it should be included in the dataset.
Guidelines for Dealing with Outliers
- Sometimes it’s best to keep outliers in the data because they can capture valuable information.
- Excluding extreme values solely due to their extremeness can distort results by removing information about inherent variability.
- Evaluate whether an outlier appropriately reflects the target population, subject area, research question, and research methodology.
- Consider whether anything unusual happened during measurement or if there is anything substantially different about the observation.
- Check for measurement or data entry errors.
Actions Based on Outlier Type
Measurement or data entry error: Correct if possible; otherwise, remove the observation.
Not part of the population: Remove the outlier.
Natural part of the population: Do not remove.
When removing outliers, document the excluded data points and explain the reasoning.
Be able to attribute a specific cause for removing outliers.
Alternatively, perform the analysis with and without the outliers to discuss the differences, especially when unsure about removing an outlier or when there is disagreement within a group.