Lecture Note: Data Cleaning and Transformation

Bias and Variance Trade-off

Previously, we discussed bias and variance, and regularization as a method to balance model complexity by adjusting lambda.
Regularization appears in various methods, like tree-based models for regression and classification, where trees' complexity needs reduction.

Regularization and Lambda Selection

Regularization involves a lambda parameter, adjustable for L1 (Lasso) and L2 regularization.
Cross-validation can be employed to select the lambda hyperparameter. Grid search allows testing different lambda values, comparing models, and selecting the best one using cross-validation techniques (e.g., k-fold cross-validation).

Data Cleaning and Transformation

The goal is to transform data-related tasks into a more enjoyable process, focusing on cleaning, transformation, and visualization.
Data integration is assumed, meaning data is accessible in a unified way.
Data cleaning and preprocessing are akin to alchemy, with various approaches.

Data Accuracy: Detection and Correction

The objective is to identify and rectify inaccuracies in the data before modeling.
Example dataset: A table of individuals with weight, height, and age.
- Issues: Negative weight (impossible), excessively high weight for a six-month-old baby.

Missing Values

Virtually all datasets contain missing values, often represented as "NaN" (Not a Number) in data frames.
Methods to identify: df.isna().sum() (number of NaN values per column), df['column'].isna() (boolean mask of NaN values in a column).
Often, missing values are represented by specific values (e.g., a huge number or a string), necessitating careful examination of downloaded data.

Reasons for Missing Values

Human error, survey interruptions, sensor failures are common causes.
Categorizing missing values is important for addressing them appropriately.
Types: Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR).

Types of Missing Values

Missing Completely at Random (MCAR)
- The likelihood of a value missing is independent of both the observation and the missing value itself.
- Example: Air quality sensor failing randomly.
- Survey example: Respondent skipping a question randomly, unrelated to the question's content or the respondent.
Missing at Random (MAR)
- The missing value is related to the observation but not to the missing value itself.
- Example: Air pump sensor failing when wind speed is high.
- Survey example: Younger respondents not answering an email question, unrelated to their income level.
Missing Not at Random (MNAR)
- The missing value is related to the value itself.
- Example: Air quality sensor failing when air quality is very poor.
- Survey example: High-income individuals less likely to answer an income question.

Approaches to Handling Missing Values

Principles:
- Preserve as much data as possible to improve model performance.
- Minimize bias.
Considerations:
- Reasons for missing values.
- Data type.
Four Approaches:
- Keep missing values as is.
- Remove rows with missing values.
- Remove columns with missing values.
- Impute missing values.

Approach 1: Keep Missing Values

Applicable when sharing data with third parties, as the recipient may have their own strategies for handling missing values.
Useful when methods explicitly handle missing values (e.g., k-NN imputation).

Approach 2: Remove Rows

Also known as removing observations.
Should be avoided for MAR and MNAR data as it can introduce bias. For example, removing all people who didn't answer income, when the missing value is related to income will bias the model.
Acceptable as a last resort for MCAR data, but only after other options have been exhausted.

Approach 3: Remove Columns

Involves removing features (columns) with missing values.
Problematic if the project focuses on the removed feature.
A practical, yet debatable, threshold is 25% missing values. If a column has more than 25% missing data, it may be considered for removal.

Approach 4: Impute Data

Involves estimating missing values based on other available values.
Methods:
- Regression analysis: Estimate income based on neighborhood.
- Statistical imputation: Using mean, median, or mode to fill missing values.
- Time series interpolation: Using moving averages for missing values in time series data.

Outliers

Outliers are extreme values in the dataset that require handling based on the model used.
Outliers can result from errors, represent actual extreme values, or indicate fraud.

Detection

Visual inspection of data distributions is one method.
Using quartiles to identify values far from the median. The interquantile range is defined as the range between the 25th and 75th percentile.
A common outlier detection method is based on the interquartile range (IQR).
- IQR = Q3 - Q1 where Q1 and Q3 are the first and third quartiles respectively.
- Lower bound = Q1 - k \times IQR
- Upper bound = Q3 + k \times IQR
- where k is a parameter (typically 1.5).

Handling Outliers

Approaches:
- Do nothing &rightarrow Models like neural networks can handle outliers.
- Use upper and lower caps &rightarrow Replacing outliers with maximum and minimum values.
- Log transformation &rightarrow Useful when data has a long tail distribution.
- Remove outliers &rightarrow Use with caution, as it can introduce bias.

Errors

Errors are inaccurate data values, such as negative weight, which are not representative of the phenomenon being measured.

Types of Errors

Random errors: Unavoidable and often handled by models like linear regression.
Systematic errors: Problematic as they occur under specific situations and can introduce bias.

Standard Data Transformation

Data transformation involves changing data formats to enable the use of specific models. Standardizing data helps models by bringing all the data to the same range.
Example: Transforming categorical data for regression analysis.

Common Transformation Techniques

Standardization &rightarrow Scaling data to have a mean of zero and unit variance for each column.
- x_{standardized} = \frac{x - \mu}{\sigma}, where \mu is the mean and \sigma is standard deviation
Normalization &rightarrow Scaling data to a range between zero and one.
- x{normalized} = \frac{x - x{min}}{x{max} - x{min}}
Log transformation &rightarrow Applying logarithm to data, useful for reducing skewness.
- Improves linear regression when data is long-tailed (e.g., income).

Categorical Features

Transformation requirements:
- Categorical features (e.g., names of sceneries or cities) are needed to be transformed to numerical features for use in many machine learning models.
Methods:
- One-hot encoding &rightarrow Creating a binary column for each category.
- Ordering &rightarrow Ranking categories based on inherent order, where applicable.
- Attribute creation &rightarrow Creating a numerical attribute based on domain knowledge.

Numerical Features into Categories

The inverse process of transforming numerical features into categorical ones can also be performed based on the specific applications needed.
Transform weight an height into categories and use BMI. This can be ordered.

Student Transformation

Moving averages: Averaging values over a previous time window to reduce noise, commonly used in time series data.

Data Visualization Tips

Helps in understanding and communicating data insights.
Avoid non-effective visualization types such as pie charts.
Use bar charts and heatmaps to improve readability and increase insight extraction.