Diagnosing and Cleaning Data Notes 9/2
Data Preparation
- Data collection: systematic gathering of relevant data from various sources.
- Data wrangling (data preprocessing): cleaning, transforming, and integrating data to facilitate effective analysis and storytelling.
- Data management: storage of data to make it available for analysis.
- CLEANING DATA: fix or remove incomplete, abnormal, or inconsistent data to improve quality for accurate analysis and storytelling.
- TRANSFORMING DATA: organize and format data into a structure useful for efficient analysis.
- INTEGRATING DATA: merge multiple data sources to provide comprehensive insights beyond any single source.
- This module focuses on CLEANING THE DATA primarily.
Why diagnose and clean data?
- Cleaning/diagnosing data is one of the most important and time-consuming steps in analysis.
- Garbage In, Garbage Out: flawed data → flawed results.
- Diagnosing and cleaning data:
- Ensures data accuracy and reliability for analysis
- Prevents errors and biases that can lead to incorrect conclusions
Cleaning Data with Spreadsheet Functions
- Use spreadsheet functions to streamline cleaning processes.
- Sorting data: organizing information in a specific order for better analysis.
- Filtering data: focus on specific criteria or remove unwanted data entries.
What is data cleaning?
- Data cleaning: fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset.
- TYPES OF ERRORS DIAGNOSIS AND CLEANING WE WILL DO:
- Identifying and imputing missing values
- Identifying and imputing extreme outlier values and errors
- Identifying the shape of the variable's distribution
- Correcting misspellings
- Deleting duplicate records and/or records with missing dependent data
- Data Cleaning is iterative and often occurs multiple times throughout the data science lifecycle.
Understanding Variable Distribution
- Shapes: distribution shapes (e.g., symmetric, skewed) affect interpretation and modeling.
- Symmetry vs Skewness: determines appropriate measures of center and spread and whether transformation is needed for relationships/models.
Quantitative Distribution Shapes
- Understanding distribution shape helps interpret data, choose appropriate measures, and decide on transformations when modeling.
- Key distinctions: symmetry vs skewness.
Describing the distribution of a quantitative variable
- A histogram显示s the distribution of a quantitative variable.
- Modality refers to the shape, location, and central range of values, not a single value.
Categorical Distribution Shapes
- Understanding category distribution helps identify:
- The mode (most frequent category) for imputation
- Low-frequency/underrepresented categories that may need handling in downstream analyses
Missing Values
- Missing values are expected data points that are absent.
- Causes: data entry errors, equipment malfunctions, survey non-response, etc.
- If not addressed, can bias results and produce inaccurate conclusions.
- Missing values appear as blank cells or coded values such as 999 or \text{NaN}.
Strategies for Handling Missing Values
- Deletion: remove observational units if a value is missing for any variables of interest; not the first choice because valuable information may be in other variables. Used only for missing values in the dependent variable.
- Imputation with Internal Data: use data within the dataset to estimate missing values; univariate strategies for independent variables.
- Imputation with External Data: use values from another dataset; requires credible relevance of external sources.
Deletion: Why not just delete?
- We should ONLY delete missing values in the dependent/target variable to avoid imputing the target of interest.
- Deletion can discard a lot of data and bias results if used inappropriately.
Imputation with Internal Data
- Univariate strategies (using the variable itself):
- Mean-Based Imputation: imputing with the average; appropriate for quantitative data with a symmetric distribution.
- Median-Based Imputation: imputing with the central value; suitable for symmetric or skewed distributions; recommended for quantitative data in this course (Excel: =Median()).
- Mode-Based Imputation: imputing with the most frequent value; for categorical variables (Excel: Frequency Table / Pivot Table).
- Bivariate strategies: use relationships with other variables (e.g., regression-based models, KNN) to inform imputations; often beyond basic scope but more reliable in practice.
- Note: mean imputation can be inappropriate if distributions are not symmetric (example: ages with a few nontraditional students).
Missing Value Scenarios
- If a categorical variable has many missing values, reassign them to a new category "missing".
- If a quantitative variable has >30\% missing, consider dropping the variable unless there is a reason to keep it.
- Large number of missing values scenarios:
- MCAR (Missing Completely At Random): missingness is random with no relation to other variables; use median-based univariate imputation or advanced ML techniques.
- MAR (Missing At Random): missingness related to other variables but not the variable itself; use information from other variables for imputation.
- MNAR (Missing Not At Random): missingness related to the reason it’s missing or to the variable itself.
Missing at Random vs Missing Not at Random
- Missing at Random (MAR): missingness related to observed variables but not the missing variable itself (e.g., weight missing more often by gender).
- Missing Not at Random (MNAR): missingness related to the missing value or reason (e.g., lower education reporting less education).
Missing at Random
- If data are MAR and no identifiably related relationship exists to other variables, impute as:
- Categorical: MODE (most frequent category).
- Quantitative: MEDIAN (robust to non-normality).
Outliers
- An outlier is an extremely high or low data point relative to the rest of the data.
- Types:
- Data entry error: implausible value from data collection.
- Anomaly: could be real but highly unlikely compared to the distribution.
- Correcting Outliers:
- Deletion: not usually first choice due to potential data loss.
- Winsorizing Imputation: cap extreme values at a threshold (often a percentile, e.g., 95th/5th percentiles).
- Median/Mean-Based Imputation: use central values for true errors.
Visually Identifying Outliers
- Histograms and Boxplots help identify potential outliers and anomalies.
- Example indicators: impossible values (e.g., extreme sleep hours) vs plausible extremes.
Mathematically Identifying Outliers
- IQR Method (recommended in this course):
- \text{IQR} = Q3 - Q1
- Lower boundary: \text{LB} = Q_1 - 1.5\times\text{IQR}
- Upper boundary: \text{UB} = Q_3 + 1.5\times\text{IQR}
- Z-score Method (not used in this course):
- z = \frac{X - \mu}{\sigma}
- Outlier if |z| > 3, but only appropriate for roughly normal distributions.
- The IQR method works well regardless of symmetry/skewness.
Correcting Outliers
- Start with a trim: address obvious, extreme outliers first without imputing all at once.
- Rationale: avoid overcorrecting too early; can impute later if needed.
Additional Outlier Methods
- Deletion: avoid losing other valuable information.
- Winsorizing Imputation: cap upper/lower extremes at chosen thresholds (often percentiles, e.g., above the 95th percentile or below the 5th percentile).
- Median-Based Imputation: for true data-entry errors, replace with central value.
Misspellings and Categorical Errors
- Misspelled categories act as new, separate categories and inflate dimensionality.
- Fix misspellings to ensure consistency in analyses.
Spelling Errors
- Use sort and filter to locate misspelled categories and correct spellings.
Case Study: Dewey Defeats Truman
- 1948 election: newspaper published incorrect winner; anomalous data could reveal errors before publication.
- Highlights importance of thorough data cleaning in analysis.