Diagnosing and Cleaning Data Notes 9/2

Data Preparation

Data collection: systematic gathering of relevant data from various sources.
Data wrangling (data preprocessing): cleaning, transforming, and integrating data to facilitate effective analysis and storytelling.
Data management: storage of data to make it available for analysis.

Data Wrangling: Clean, Transform, and Integrate

CLEANING DATA: fix or remove incomplete, abnormal, or inconsistent data to improve quality for accurate analysis and storytelling.
TRANSFORMING DATA: organize and format data into a structure useful for efficient analysis.
INTEGRATING DATA: merge multiple data sources to provide comprehensive insights beyond any single source.
This module focuses on CLEANING THE DATA primarily.

Why diagnose and clean data?

Cleaning/diagnosing data is one of the most important and time-consuming steps in analysis.
Garbage In, Garbage Out: flawed data → flawed results.
Diagnosing and cleaning data:
- Ensures data accuracy and reliability for analysis
- Prevents errors and biases that can lead to incorrect conclusions

Cleaning Data with Spreadsheet Functions

Use spreadsheet functions to streamline cleaning processes.
Sorting data: organizing information in a specific order for better analysis.
Filtering data: focus on specific criteria or remove unwanted data entries.

What is data cleaning?

Data cleaning: fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset.
TYPES OF ERRORS DIAGNOSIS AND CLEANING WE WILL DO:
- Identifying and imputing missing values
- Identifying and imputing extreme outlier values and errors
- Identifying the shape of the variable's distribution
- Correcting misspellings
- Deleting duplicate records and/or records with missing dependent data
Data Cleaning is iterative and often occurs multiple times throughout the data science lifecycle.

Understanding Variable Distribution

Shapes: distribution shapes (e.g., symmetric, skewed) affect interpretation and modeling.
Symmetry vs Skewness: determines appropriate measures of center and spread and whether transformation is needed for relationships/models.

Quantitative Distribution Shapes

Understanding distribution shape helps interpret data, choose appropriate measures, and decide on transformations when modeling.
Key distinctions: symmetry vs skewness.

Describing the distribution of a quantitative variable

A histogram显示s the distribution of a quantitative variable.
Modality refers to the shape, location, and central range of values, not a single value.

Categorical Distribution Shapes

Understanding category distribution helps identify:
- The mode (most frequent category) for imputation
- Low-frequency/underrepresented categories that may need handling in downstream analyses

Missing Values

Missing values are expected data points that are absent.
Causes: data entry errors, equipment malfunctions, survey non-response, etc.
If not addressed, can bias results and produce inaccurate conclusions.
Missing values appear as blank cells or coded values such as 999 or \text{NaN}.

Strategies for Handling Missing Values

Deletion: remove observational units if a value is missing for any variables of interest; not the first choice because valuable information may be in other variables. Used only for missing values in the dependent variable.
Imputation with Internal Data: use data within the dataset to estimate missing values; univariate strategies for independent variables.
Imputation with External Data: use values from another dataset; requires credible relevance of external sources.

Deletion: Why not just delete?

We should ONLY delete missing values in the dependent/target variable to avoid imputing the target of interest.
Deletion can discard a lot of data and bias results if used inappropriately.

Imputation with Internal Data

Univariate strategies (using the variable itself):
- Mean-Based Imputation: imputing with the average; appropriate for quantitative data with a symmetric distribution.
- Median-Based Imputation: imputing with the central value; suitable for symmetric or skewed distributions; recommended for quantitative data in this course (Excel: =Median()).
- Mode-Based Imputation: imputing with the most frequent value; for categorical variables (Excel: Frequency Table / Pivot Table).
Bivariate strategies: use relationships with other variables (e.g., regression-based models, KNN) to inform imputations; often beyond basic scope but more reliable in practice.
Note: mean imputation can be inappropriate if distributions are not symmetric (example: ages with a few nontraditional students).

Missing Value Scenarios

If a categorical variable has many missing values, reassign them to a new category "missing".
If a quantitative variable has >30\% missing, consider dropping the variable unless there is a reason to keep it.
Large number of missing values scenarios:
- MCAR (Missing Completely At Random): missingness is random with no relation to other variables; use median-based univariate imputation or advanced ML techniques.
- MAR (Missing At Random): missingness related to other variables but not the variable itself; use information from other variables for imputation.
- MNAR (Missing Not At Random): missingness related to the reason it’s missing or to the variable itself.

Missing at Random vs Missing Not at Random

Missing at Random (MAR): missingness related to observed variables but not the missing variable itself (e.g., weight missing more often by gender).
Missing Not at Random (MNAR): missingness related to the missing value or reason (e.g., lower education reporting less education).

Missing at Random

If data are MAR and no identifiably related relationship exists to other variables, impute as:
- Categorical: MODE (most frequent category).
- Quantitative: MEDIAN (robust to non-normality).

Outliers

An outlier is an extremely high or low data point relative to the rest of the data.
Types:
- Data entry error: implausible value from data collection.
- Anomaly: could be real but highly unlikely compared to the distribution.
Correcting Outliers:
- Deletion: not usually first choice due to potential data loss.
- Winsorizing Imputation: cap extreme values at a threshold (often a percentile, e.g., 95th/5th percentiles).
- Median/Mean-Based Imputation: use central values for true errors.

Visually Identifying Outliers

Histograms and Boxplots help identify potential outliers and anomalies.
Example indicators: impossible values (e.g., extreme sleep hours) vs plausible extremes.

Mathematically Identifying Outliers

IQR Method (recommended in this course):
- \text{IQR} = Q3 - Q1
- Lower boundary: \text{LB} = Q_1 - 1.5\times\text{IQR}
- Upper boundary: \text{UB} = Q_3 + 1.5\times\text{IQR}
Z-score Method (not used in this course):
- z = \frac{X - \mu}{\sigma}
- Outlier if |z| > 3, but only appropriate for roughly normal distributions.
The IQR method works well regardless of symmetry/skewness.

Correcting Outliers

Start with a trim: address obvious, extreme outliers first without imputing all at once.
Rationale: avoid overcorrecting too early; can impute later if needed.

Additional Outlier Methods

Deletion: avoid losing other valuable information.
Winsorizing Imputation: cap upper/lower extremes at chosen thresholds (often percentiles, e.g., above the 95th percentile or below the 5th percentile).
Median-Based Imputation: for true data-entry errors, replace with central value.

Misspellings and Categorical Errors

Misspelled categories act as new, separate categories and inflate dimensionality.
Fix misspellings to ensure consistency in analyses.

Spelling Errors

Use sort and filter to locate misspelled categories and correct spellings.

Case Study: Dewey Defeats Truman

1948 election: newspaper published incorrect winner; anomalous data could reveal errors before publication.
Highlights importance of thorough data cleaning in analysis.