ACCT 331 LECTURE 12

Exploratory Data Analysis (EDA) and Feature Selection

Introduction to Exploratory Data Analysis (EDA)

Definition: EDA involves analyzing datasets to summarize their characteristics, often using individual methods, which is critical for training patterns and understanding relationships that might not be obvious from mere statistical analysis.
Purpose: EDA allows for a better understanding of the data, making it easier to discover patterns, establish anomalies, test existing hypotheses, and even create new hypotheses by finding unexpected insights.
Nature: It's more akin to statistical analysis and data visualization than machine learning per se, making data relationships easily discernible for humans.
Importance: It is considered a very critical step, often described as the "heart" of the data analysis process, leading to insights that help refine algorithms and testing strategies.

Key Concepts in EDA and Machine Learning

Early Steps: EDA follows data dictionary creation and schema definition, moving into data preparation, cleaning, coding, feature engineering, and class labeling.
Outlier Detection: A key aspect of EDA is the identification and investigation of outliers, which can represent genuinely unusual but important data points, or simply bad data requiring correction.
Feature Engineering and Selection: EDA helps in determining correlations and feature importance, aiding in deciding which features to include or exclude from a model. This is crucial as having too many features (e.g., $5000$ ) is not necessarily beneficial.
Hyperparameter Considerations: Insights from EDA can help in thinking about hyperparameters, which are parameters set by the data scientist/operator, not learned by the model.
Model Selection: EDA aids in identifying key features, understanding data characteristics, and verifying assumptions, potentially leading to new hypotheses based on visualizations.
Handling Missing Data: EDA is particularly helpful for imputation or deletion analysis of missing data through the application of statistical techniques.

Common Visualization Techniques in EDA

Histograms

Definition: A graphical representation of the distribution of data, displaying data by grouping it into "bins" or intervals and counting observations within each bin.
Example: Instead of charting individual ages (18, 19, 20…), one might bin them into ranges like $18-30$ , $31-50$ , $50+$ to make relationships more meaningful.
Use Cases: Understanding data distribution, identifying patterns, and detecting outliers.
Insights: Helps visualize central tendency and spread, allowing for further investigation of outliers (e.g., if they are significant or bad data).

Scatter Plots

A very common method for visualizing data and exploring relationships between two variables.

Correlation Matrix

Definition: A table that displays the pairwise correlations between multiple variables.
Correlation Range: Values range from $-1$ to $1$ .
- $-1$ : Perfect negative correlation (as one variable increases, the other decreases).
- $1$ : Perfect positive correlation (as one variable increases, the other also increases).
- $0$ : No linear correlation.
Use Case: Helps discern relationships between features (e.g., Feature 1 is negatively correlated with Feature 3, or Feature 2 is correlated with multiple others), guiding moves in model development.

Pair Plots (Scatter Plot Matrix)

Definition: Displays pairwise relationships between features, similar to a scatter plot matrix.
Key Difference: Unlike a simple scatter plot matrix, pair plots often include a bar chart or histogram that illustrates the distribution of each individual feature along the diagonal.

Crosstabs (Contingency Tables)

Definition: A simple table used to look at frequency distributions and relationships between categorical variables.
Use Cases: Identifying patterns and performing various statistical tests, especially with categorical data.

Real-World Example: H&M Project

Context: A project for H&M (specifically the Swedish parent company) to improve product planning.
Business Objective: To predict online and in-store sales more accurately for optimal product inventory (right size, right color, right time) by analyzing demand and sales.
Product Identification: SKUs (Stock Keeping Units) are used instead of just products (e.g., a green T-shirt comes in S, M, L, XL, each being a distinct SKU).
Project Goals: Understand the impact on weekly/monthly sales, product features, online purchase behavior trends, identify features contributing to sales, and use insights for model selection.
Data Volume:
- $54,000$ SKUs
- $3.2$ million online observations
- $6.7$ million store observations/transactions
- $8.2$ million total sales
Data Dictionary: A crucial first step, as previously discussed, to understand data details (e.g., clarifying reasons for discounts, which was relevant for understanding employee vs. customer promotions).
Insights from EDA:
- Average Monthly Sales of T-shirts: Bar charts revealed seasonality, with significant sales peaks around December holidays for in-store sales, which was consistent with expectations.
- Online Sales Distribution by Size: Analyzing T-shirt sales by size revealed a generally normal distribution, but with unexpected peaks in demand for very large sizes (XXL, etc.). This indicated that H&M was assuming a normal distribution and thus under-merchandising these outlier sizes (not buying enough small and large sizes).
- Business Impact: This finding shifted predictive models from simple linear relationships to more complex ones, emphasizing the value of visualizing data to uncover non-obvious patterns.
- Out-of-Stock Ratio: A proxy metric created to quantify lost revenue due to unmet demand for products not in stock. This chart illustrated the distribution of such occurrences, often extending $2-3$ standard deviations, highlighting significant missed sales opportunities.
Key Features Identified for Future Exploration: Previous week/month sales, wide enhancement correlations, time-to-date, seasonality, demand, holiday, product attributes, and product lifecycle.

Features and Dimensionality

What are Features?

Definition: Features represent different attributes or variables within the data.
Relationship to Dimensions: Each feature directly corresponds to a dimension in the data space. More features mean more dimensions.
Example: For an image classifier, each pixel value is a feature. A $28 imes 28$ pixel image would have $784$ features, meaning it exists in a $784$ -dimensional space.

High Dimensional Space

Concept: In machine learning, we often operate with many features (e.g., hundreds or thousands), leading to high-dimensional spaces.
Human Visualization Challenge: Humans can easily visualize $2$ D and $3$ D, but struggle with $4$ D or higher, making the concept difficult to teach or illustrate directly.
Analogy (from Video):
- People as High-Dimensional: Think of people (e.g., famous scientists) described by dimensions like birth year, birth location, field of study. Machine learning sees these as numbers and places related points closer together in high-dimensional space.
- Words as High-Dimensional: Words can be represented as data points with many dimensions (e.g., $200$ dimensions). Techniques like t-SNE can cluster words based on their meaning, even without explicitly telling the computer their meaning, by analyzing usage patterns in millions of sentences (e.g., clusters for numbers, months, cities, musical terms).
- Images as High-Dimensional: Handwritten digits (0-9) are treated as images where each pixel (e.g., $784$ pixels for a $28 imes 28$ image) is a dimension. t-SNE can cluster these images in high-dimensional space, learning their meaning and grouping similar digits.

The Curse of Dimensionality

Problem: As the number of features (dimensions) increases, the space "explodes" in size. Data points become increasingly sparse and separated within this vast space.
Consequence: Algorithms find it much harder to discover meaningful relationships and patterns across widely separated data points, leading to decreased model performance and increased computational cost.
Example: A grayscale image of size $50 imes 50$ pixels exists in a $2500$ -dimensional space ( $50 imes 50 = 2500$ ).

Dimensionality Reduction

Purpose: To address the curse of dimensionality by shrinking the feature space and focusing on the most important features.
Benefit: Helps algorithms work more effectively by reducing sparsity and noise.
Method Example: Principal Component Analysis (PCA) is a common technique used for dimensionality reduction.

Feature Selection

Goal: To identify and select the most important features from a dataset.
Reasons for Importance:
- Higher dimensionality causes problems.
- Some features may not be correlated to the target variable.
- Some features may be redundant (highly correlated with each other).
- Some features may be highly correlated but irrelevant.
Benefits:
- Simplified models with better performance.
- Improved interpretability (easier to understand how the model works).
- Easier to debug models.

Methodologies for Feature Selection

Filter Methods:
- Approach: Evaluate features based on their characteristics independent of the machine learning model.
- Nature: Act as a pre-processing step, without involving the algorithm in the selection phase (e.g., using correlation coefficients, chi-squared tests).
Search Methods:
- Approach: Involve the machine learning algorithm directly. Subsets of features are created, fed into the model, and the model's performance is evaluated to determine which features to keep.
- Challenge: Can be computationally expensive, especially with complex algorithms or deep learning, as the model must be run multiple times.
- Types:
  - Forward Selection:
    - Starts with an empty set of features.
    - Adds one feature at a time that provides the most significant improvement in model performance.
    - Continues adding features until a stopping point (e.g., no further significant improvement, or a set number of features).
  - Backward Selection:
    - Starts with all available features.
    - Removes one feature at a time that leads to the least significant decrease (or an increase) in model performance.
    - Continues removing features until an optimal subset is found.
Embedded Methods:
- Approach: Feature selection is built directly into the model construction process. The algorithm itself performs feature selection during training.
- Examples:
  - Lasso Regression (L1 Regularization): A type of linear regression that penalizes the absolute size of the regression coefficients. It can shrink some coefficients exactly to zero, effectively eliminating the corresponding features from the model.
  - Ridge Regression (L2 Regularization): Another type of linear regression that penalizes the squared size of the regression coefficients. It shrinks coefficients but typically does not set them exactly to zero, so it doesn't perform feature elimination in the same way as Lasso.
- Process: These methods penalize coefficients, and if a coefficient is penalized to zero, the feature is eliminated. This makes interpreting feature importance more complex as the regularization impacts their contributions simultaneously.