T

Diagnosing and Cleaning Data Notes 9/2

Data Preparation

  • Data collection: systematic gathering of relevant data from various sources.
  • Data wrangling (data preprocessing): cleaning, transforming, and integrating data to facilitate effective analysis and storytelling.
  • Data management: storage of data to make it available for analysis.

Data Wrangling: Clean, Transform, and Integrate

  • CLEANING DATA: fix or remove incomplete, abnormal, or inconsistent data to improve quality for accurate analysis and storytelling.
  • TRANSFORMING DATA: organize and format data into a structure useful for efficient analysis.
  • INTEGRATING DATA: merge multiple data sources to provide comprehensive insights beyond any single source.
  • This module focuses on CLEANING THE DATA primarily.

Why diagnose and clean data?

  • Cleaning/diagnosing data is one of the most important and time-consuming steps in analysis.
  • Garbage In, Garbage Out: flawed data → flawed results.
  • Diagnosing and cleaning data:
    • Ensures data accuracy and reliability for analysis
    • Prevents errors and biases that can lead to incorrect conclusions

Cleaning Data with Spreadsheet Functions

  • Use spreadsheet functions to streamline cleaning processes.
  • Sorting data: organizing information in a specific order for better analysis.
  • Filtering data: focus on specific criteria or remove unwanted data entries.

What is data cleaning?

  • Data cleaning: fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset.
  • TYPES OF ERRORS DIAGNOSIS AND CLEANING WE WILL DO:
    • Identifying and imputing missing values
    • Identifying and imputing extreme outlier values and errors
    • Identifying the shape of the variable's distribution
    • Correcting misspellings
    • Deleting duplicate records and/or records with missing dependent data
  • Data Cleaning is iterative and often occurs multiple times throughout the data science lifecycle.

Understanding Variable Distribution

  • Shapes: distribution shapes (e.g., symmetric, skewed) affect interpretation and modeling.
  • Symmetry vs Skewness: determines appropriate measures of center and spread and whether transformation is needed for relationships/models.

Quantitative Distribution Shapes

  • Understanding distribution shape helps interpret data, choose appropriate measures, and decide on transformations when modeling.
  • Key distinctions: symmetry vs skewness.

Describing the distribution of a quantitative variable

  • A histogram显示s the distribution of a quantitative variable.
  • Modality refers to the shape, location, and central range of values, not a single value.

Categorical Distribution Shapes

  • Understanding category distribution helps identify:
    • The mode (most frequent category) for imputation
    • Low-frequency/underrepresented categories that may need handling in downstream analyses

Missing Values

  • Missing values are expected data points that are absent.
  • Causes: data entry errors, equipment malfunctions, survey non-response, etc.
  • If not addressed, can bias results and produce inaccurate conclusions.
  • Missing values appear as blank cells or coded values such as 999 or \text{NaN}.

Strategies for Handling Missing Values

  • Deletion: remove observational units if a value is missing for any variables of interest; not the first choice because valuable information may be in other variables. Used only for missing values in the dependent variable.
  • Imputation with Internal Data: use data within the dataset to estimate missing values; univariate strategies for independent variables.
  • Imputation with External Data: use values from another dataset; requires credible relevance of external sources.

Deletion: Why not just delete?

  • We should ONLY delete missing values in the dependent/target variable to avoid imputing the target of interest.
  • Deletion can discard a lot of data and bias results if used inappropriately.

Imputation with Internal Data

  • Univariate strategies (using the variable itself):
    • Mean-Based Imputation: imputing with the average; appropriate for quantitative data with a symmetric distribution.
    • Median-Based Imputation: imputing with the central value; suitable for symmetric or skewed distributions; recommended for quantitative data in this course (Excel: =Median()).
    • Mode-Based Imputation: imputing with the most frequent value; for categorical variables (Excel: Frequency Table / Pivot Table).
  • Bivariate strategies: use relationships with other variables (e.g., regression-based models, KNN) to inform imputations; often beyond basic scope but more reliable in practice.
  • Note: mean imputation can be inappropriate if distributions are not symmetric (example: ages with a few nontraditional students).

Missing Value Scenarios

  • If a categorical variable has many missing values, reassign them to a new category "missing".
  • If a quantitative variable has >30\% missing, consider dropping the variable unless there is a reason to keep it.
  • Large number of missing values scenarios:
    • MCAR (Missing Completely At Random): missingness is random with no relation to other variables; use median-based univariate imputation or advanced ML techniques.
    • MAR (Missing At Random): missingness related to other variables but not the variable itself; use information from other variables for imputation.
    • MNAR (Missing Not At Random): missingness related to the reason it’s missing or to the variable itself.

Missing at Random vs Missing Not at Random

  • Missing at Random (MAR): missingness related to observed variables but not the missing variable itself (e.g., weight missing more often by gender).
  • Missing Not at Random (MNAR): missingness related to the missing value or reason (e.g., lower education reporting less education).

Missing at Random

  • If data are MAR and no identifiably related relationship exists to other variables, impute as:
    • Categorical: MODE (most frequent category).
    • Quantitative: MEDIAN (robust to non-normality).

Outliers

  • An outlier is an extremely high or low data point relative to the rest of the data.
  • Types:
    • Data entry error: implausible value from data collection.
    • Anomaly: could be real but highly unlikely compared to the distribution.
  • Correcting Outliers:
    • Deletion: not usually first choice due to potential data loss.
    • Winsorizing Imputation: cap extreme values at a threshold (often a percentile, e.g., 95th/5th percentiles).
    • Median/Mean-Based Imputation: use central values for true errors.

Visually Identifying Outliers

  • Histograms and Boxplots help identify potential outliers and anomalies.
  • Example indicators: impossible values (e.g., extreme sleep hours) vs plausible extremes.

Mathematically Identifying Outliers

  • IQR Method (recommended in this course):
    • \text{IQR} = Q3 - Q1
    • Lower boundary: \text{LB} = Q_1 - 1.5\times\text{IQR}
    • Upper boundary: \text{UB} = Q_3 + 1.5\times\text{IQR}
  • Z-score Method (not used in this course):
    • z = \frac{X - \mu}{\sigma}
    • Outlier if |z| > 3, but only appropriate for roughly normal distributions.
  • The IQR method works well regardless of symmetry/skewness.

Correcting Outliers

  • Start with a trim: address obvious, extreme outliers first without imputing all at once.
  • Rationale: avoid overcorrecting too early; can impute later if needed.

Additional Outlier Methods

  • Deletion: avoid losing other valuable information.
  • Winsorizing Imputation: cap upper/lower extremes at chosen thresholds (often percentiles, e.g., above the 95th percentile or below the 5th percentile).
  • Median-Based Imputation: for true data-entry errors, replace with central value.

Misspellings and Categorical Errors

  • Misspelled categories act as new, separate categories and inflate dimensionality.
  • Fix misspellings to ensure consistency in analyses.

Spelling Errors

  • Use sort and filter to locate misspelled categories and correct spellings.

Case Study: Dewey Defeats Truman

  • 1948 election: newspaper published incorrect winner; anomalous data could reveal errors before publication.
  • Highlights importance of thorough data cleaning in analysis.