Outlier Analysis Notes

Outliers: Detection, Effects, and Solutions

Definition of Outliers

  • Outliers are unusual values in a dataset that can distort statistical analyses and violate their assumptions.
  • Analysts must decide how to handle outliers.
  • Removing outliers is only appropriate for specific reasons.
  • Outliers can provide valuable information about the subject area and data collection process.
  • It's crucial to understand how outliers occur and whether they might reoccur as a normal part of the process or study area.
  • Resisting the temptation to remove outliers inappropriately can be challenging because they increase variability and reduce statistical power.
  • Excluding outliers can artificially cause results to become statistically significant.

Finding Outliers

  • Outliers are data points far from other data points, representing unusual values.
  • They can cause statistical tests to miss significant findings or distort real results.
  • No strict statistical rules exist for definitively identifying outliers.
  • Identifying outliers depends on subject-area knowledge and an understanding of the data collection process.
  • While there's no solid mathematical definition, guidelines and statistical tests can help identify potential outlier candidates.

Impact of Outliers

  • Outliers are notably different values that can cause problems in statistical procedures.
  • A single outlier can significantly affect results.

Example Dataset

  • A dataset of 15 height measurements of human males is used, including one outlier.
  • The table below shows the mean height and standard deviation WITH and WITHOUT the outlier:
WITH OutlierWITHOUT OutlierDifference (in m)
Mean2.401.800.60
Standard Deviation2.330.142.16
  • A single value changes
    • the mean height by 0.6m (2 feet)
    • the standard deviation by 2.16m (7 feet).
  • Hypothesis tests using the mean with the outlier will be inaccurate.
  • The larger standard deviation will severely reduce statistical power.
  • Potential outliers should be identified before performing statistical analyses.

Ways to Find Outliers

  • Various methods exist for finding values that are unusual compared to the rest of the dataset.

Sorting Your Datasheet

  • Sorting the datasheet for each variable helps highlight unusual values.
  • It allows a quick identification of unusually high or low values, though it doesn't quantify the outlier's degree of unusualness.

Graphing Your Data

  • Boxplots, histograms, and scatterplots can highlight outliers.
Boxplots
  • Boxplots explicitly indicate outliers using asterisks or other symbols, based on the interquartile method with fences.
  • Boxplots can also be used to find outliers when you have groups in your data.
Histograms
  • Histograms emphasize the existence of outliers by displaying isolated bars.
Scatterplots
  • Scatterplots can detect outliers in a multivariate setting.
  • An observation can be an outlier because it doesn’t fit the model, even if its individual input or output values are not unusual.
  • This type of outlier can be a problem in regression analysis.
  • Multivariate regression has numerous types of outliers.

Causes of Outliers

  • Outliers can arise from various sources, including errors and natural variation.

Data Entry and Measurement Errors

  • Errors can occur during measurement and data entry.
  • Typos can produce weird values.
  • If an outlier value is determined to be an error, correct it if possible.
  • If correction is not possible, the data point must be deleted because it’s known to be incorrect.

Sampling Problems

  • Inferential statistics use samples to draw conclusions about a specific population.
  • Studies should carefully define a population and draw a random sample from it.
  • A study might accidentally obtain an item or person that is not from the target population.
  • Unusual events or characteristics can occur that deviate from the defined population.
  • The experimenter might measure the item or subject under abnormal conditions.
  • It's possible to accidentally collect an item that falls outside the target population.
Examples of Sampling Problems
  • A study assessing product strength defines the population as the output of the standard manufacturing process. Abnormal conditions (e.g., power failure, machine setting drifting) can cause outliers.
  • These outliers can be legitimately removed because they do not reflect the target population.
  • In a bone density study, a subject with diabetes was identified as an outlier and excluded because the study aimed to model bone density growth in pre-adolescent girls without health conditions affecting bone growth.
  • If it can be established that an item or person does not represent the target population, the data point can be removed, provided there is a specific cause or reason.

Natural Variation

  • Natural variation can also produce outliers, which is not necessarily a problem.
  • All data distributions have a spread of values, and extreme values can occur with lower probabilities.
  • Large sample sizes are more likely to contain unusual values.
  • A normal distribution will have approximately 1 in 340 observations at least three standard deviations away from the mean.
  • The process or population being studied might naturally produce weird values, which are a normal part of the data distribution.
Example of Natural Variation Causing an Outlier
  • A model using historical U.S. Presidential approval ratings to predict historian ranks is affected by President Truman, who had an extremely low approval rating but a relatively good historian rank.
  • Removing this data point improves the model fit but is not justifiable because it reflects the potential surprises and uncertainty inherent in the political system.
  • It’s bad practice to remove data points simply to produce a better fitting model or statistically significant results.
  • If an extreme value is a legitimate observation and a natural part of the population, it should be included in the dataset.

Guidelines for Dealing with Outliers

  • Sometimes it’s best to keep outliers in the data because they can capture valuable information.
  • Excluding extreme values solely due to their extremeness can distort results by removing information about inherent variability.
  • Evaluate whether an outlier appropriately reflects the target population, subject area, research question, and research methodology.
  • Consider whether anything unusual happened during measurement or if there is anything substantially different about the observation.
  • Check for measurement or data entry errors.

Actions Based on Outlier Type

  • Measurement or data entry error: Correct if possible; otherwise, remove the observation.

  • Not part of the population: Remove the outlier.

  • Natural part of the population: Do not remove.

  • When removing outliers, document the excluded data points and explain the reasoning.

  • Be able to attribute a specific cause for removing outliers.

  • Alternatively, perform the analysis with and without the outliers to discuss the differences, especially when unsure about removing an outlier or when there is disagreement within a group.