ACCT 331 Lecture 11

Agents and Data Processing Overview

Agents: Autonomous entities capable of accessing diverse data sources and executing tasks independently.
Maturity Model for Organizations Deploying Agents:
- $0$ : Very basic deployment.
- $5$ : Extremely advanced deployment.
- Most organizations are currently at status $1$ or $2$ .
Trust in Agencies: Noted to be declining, an interesting trend.
Internship Recommendation (Bert): A large hedge fund with $20+$ global offices, managing $28$ billion. Strongly recommended for summer internships.

Weekly Topics and Data Engineering Focus

This Week's Focus: Data dictionary, schema, data preparation, exploratory data analysis (EDA), feature engineering, dimensionality, and future selections.
Reading Assignments: Relatively light, covering concepts discussed previously in more depth.
Process Flow: Moving into more detailed data engineering, including cleaning, prepping, outlier treatment, and feature engineering.
Iterative Process: Data preparation and feature engineering are described as iterative processes.

Data Dictionary

Purpose: Provides documentation about data features within a dataframe, describing their meaning, usage, and standard practices.
Ensures Consistency: Facilitates understanding and standardization of data, crucial for communication among stakeholders.
Reference: Serves as a key reference for developers, data analysts, and other individuals working with the data.
Example: An inventory dataset for an online store with features like Item ID, Product Name, Category, Price, etc. A data dictionary explains what each of these features represents.

Schema

Complements Data Dictionary: Defines the logical structure and relationships within and between datasets.
Blueprint: Acts as a blueprint for how data is stored and accessed, similar to an architect's plan for a house.
Guidance: Guides operations on data (e.g., if a feature is an integer, arithmetic operations can be performed; if a string, text-based operations like search).
Ensures Data Integrity: Helps maintain quality, optimize storage, and improve collaboration among developers.

Documentation Beyond Dictionary and Schema

Importance: Essential to document what operations are performed on the data and why, typically within the code (e.g., Python environment).
Good Practice: Include notes and explanations in code. Without proper documentation, understanding the data pipeline in the future becomes challenging, especially for new team members.
Example Rationale: Explaining why customer data is joined or why an operation is needed.
Capstone Project Relevance: Students will create data dictionaries, schemas, and document workflows and code for their capstone projects.

Real-World Example: Demand Function for Optimal Product Margins

Objective: To recommend optimal timing (when) and market penetration percentage (percent) for new products to achieve the highest profit margins.
Dataset Size: Approximately $1.1$ million transactions, $1100$ new and continuing products, various departments, and $520$ stores.
Data Frames: Five data frames are used:
- Inventory data by week: $9.6$ million data points, $7$ features.
- Sales data: $1.5$ million data points, $8$ features.
- Product data.
- Store data.
- Calendar data.
**Data Dictionary Application Example (Sales & Inventory Data Frames):
- Sales Features: Date (represented by a number), Store Number, Product SKU.
  - Notes: Link to calendar file for date conversion, product SKU doesn't track style/color, requiring matching with product data.
- Inventory Features: Week ending (inventory taken on Saturdays), Style/Color (specific notes on tracking inventory, pricing analysis level).

Data Preparation Workflow in the Example

Schema Mapping: Visual representation of how different data frames (Store, Sales, Product, Inventory, Calendar) are linked for analysis (e.g., mapping Sales to Inventory via Store Number and Product SKU).
Merging Datasets: Necessity to merge various datasets (e.g., sales data with product numbers).
Creating Markdown Data: Generating new tables from existing markdown information to analyze demand and marketing.
Handling Inconsistencies: Discrepancies in color and style across datasets required identifying the best source for this data.
Temporal Discrepancies: Inventory data recorded weekly, while sales data recorded daily, requiring decisions on the analysis timeframe (weekly, monthly, etc.).

Data Cleaning and Correction

Importance: Addresses issues like human errors (misspellings), improper formatting, missing values, and irrelevant data.
Inconsistencies: Standardizing varied entries (e.g., 'apples', 'Apples', 'APPLE') into a consistent format.
Missing Data: A common problem; needs careful handling as removal is often not recommended. Imputation (filling missing values) is preferred.
Outliers: Extreme values that can significantly impact algorithms (e.g., linear regression). Need to be traced; removal or retention depends on whether they represent legitimate but rare events or errors.
- Some algorithms (e.g., decision trees) are more robust to outliers.

Encoding Categorical Data

Necessity: Computers understand only numbers; categorical data (e.g., 'high', 'medium', 'low') must be converted to numeric representations.
One-Hot Encoding and Dummy Variables: Techniques used for this conversion.
- For categorical data with order (e.g., 'Bachelor's', 'Master's'), direct numeric assignment might retain order.
- One-hot encoding creates new binary features for each category, which can increase dimensionality.

Feature Engineering: Adding Value as a Professional

Definition: Creating new features or transforming existing ones to improve model performance.
Benefits: Enhances model accuracy and efficiency, reduces overfitting, and improves model interpretability.
Overfitting: Occurs when an algorithm fits training data too closely, failing to generalize well to new, unseen data.
- Feature selection can mitigate overfitting by reducing opportunities for the model to cling to noise.

Feature Creation

Concept: Combining multiple features or creating new ones from existing data to derive more useful metrics.
Examples:
- Combining sales price and square footage of a house into a new, more informative feature.
- For customer purchases prediction: Creating features like 'average purchase value', 'total expenditures', or 'number of purchases'.

Feature Transformation

Concept: Modifying original features to make them more suitable for model processing, often to achieve a distribution desirable for many algorithms (e.g., normal distribution).
Methods:
- Scaling: Adjusts the magnitude of data. Preserves data structure but changes range.
  - Normalization (Min-Max Scaling): Converts numbers to a specific range (e.g., $[0, 1]$ or $[-1, 1]$ ). Important when features need to contribute equally to a model (e.g., age vs. salary data). Formula $X_{norm} = \frac{X - X_{min}}{X_{max} - X_{min}}$ (conceptually squishes data).
  - Standardization: Transforms data to have a mean of $0$ and a standard deviation of $1$ . More concerned with distribution than range.
- Log Transformation: Replaces values with their logarithm, compressing large numbers and stretching small ones. Helps transform highly skewed data into a more normal-like distribution.
- Binning: Divides continuous features into discrete intervals or bins.
  - Example: Grouping ages into 'young', 'middle-aged', 'senior'. Reduces complexity and enables finding patterns.
- Principal Component Analysis (PCA): A dimensionality reduction technique that transforms features into a smaller set of uncorrelated components.
  - Example: Reducing 'height' and 'weight' (often highly correlated) into a single principal component that captures most of their variance.
- Rank Transformation: Assigns ranks to feature values instead of using raw values.
  - Purpose: Reduces the influence of outliers or extreme values and simplifies relationships between variables (e.g., ranking student scores instead of using raw scores).

Feature Selection

Goal: Choosing only the most relevant features to improve model performance and reduce overfitting.

Special Cases / Future Topics

Imbalanced Datasets: Briefly touched upon with a fraud detection example, where 'no fraud' cases might be 'undersampled' to balance the dataset.
Next Class: Will delve into Exploratory Data Analysis (EDA).

Agents and Data Processing

Agents: Autonomous entities accessing diverse data sources and executing tasks independently.
Maturity Model for Organizations Deploying Agents: Organizations are typically at status $1$ or $2$ on a scale from $0$ (basic) to $5$ (advanced) deployment.

Weekly Topics and Data Engineering Focus

This week's focus is on core data engineering concepts: data dictionary, schema, data preparation, EDA, feature engineering, dimensionality, and future selections.
The process is highly iterative, especially for data preparation and feature engineering, involving cleaning, prepping, outlier treatment, and feature engineering.

Data Dictionary

Purpose: Documents data features, describing their meaning, usage, and standard practices to ensure consistency and facilitate understanding among stakeholders.
Example: For an inventory dataset, it explains 'Item ID', 'Product Name', 'Category', 'Price', ensuring everyone understands what each feature represents.

Schema

Purpose: Complements the data dictionary by defining the logical structure and relationships within and between datasets. It acts as a blueprint for data storage and access.
Guidance: Dictates valid operations (e.g., arithmetic for integers, text operations for strings), ensuring data integrity, optimizing storage, and improving collaboration.

Documentation Beyond Dictionary and Schema

It's crucial to document what operations are performed on data and why, typically within the code (e.g., Python).
Good code documentation (notes, explanations) prevents future challenges in understanding the data pipeline, especially for new team members.

Real-World Example: Demand Function for Optimal Product Margins

Objective: To recommend optimal timing and market penetration for new products to achieve the highest profit margins.
Involves around $1.1$ million transactions across $5$ data frames (Inventory, Sales, Product, Store, Calendar) totaling millions of data points.
Data Dictionary Application: Identifying features like 'Date', 'Store Number', 'Product SKU' in sales data and noting links to calendar files or product data for full context.

Data Preparation Workflow

Schema Mapping: Visually links different data frames (e.g., Sales to Inventory via common keys).
Merging Datasets: Combines data from various sources (e.g., sales with product details).
Handling Inconsistencies: Resolving discrepancies in color and style across datasets by identifying authoritative sources.
Temporal Discrepancies: Addressing different recording frequencies (e.g., weekly inventory vs. daily sales) by deciding on a consistent analysis timeframe.

Data Cleaning and Correction

Importance: Essential for addressing human errors (misspellings), improper formatting, missing values, and irrelevant data.
Inconsistencies: Standardizing varied entries (e.g., 'apples', 'Apples', 'APPLE') into a consistent format.
Missing Data: A common problem where imputation (filling missing values) is generally preferred over removal, as removal can lead to data loss and biased models.
Outliers: Extreme values that can significantly impact algorithms (e.g., linear regression). They must be carefully traced to determine if they are legitimate but rare events or errors. Decision trees are generally more robust to outliers.

Encoding Categorical Data

Necessity: Converting non-numeric categorical data (e.g., 'high', 'medium', 'low') into numerical representations because computers process only numbers.
Techniques:
- Direct Numeric Assignment: For ordinal categories (e.g., 'Bachelor's', 'Master's'), assigning numbers that retain intrinsic order.
- One-Hot Encoding/Dummy Variables: Creates new binary features for each category, preventing the model from inferring incorrect ordinal relationships but potentially increasing dimensionality.

Feature Engineering: Adding Value as a Professional

Definition: Creating new features or transforming existing ones to improve model performance, enhance accuracy, reduce overfitting, and improve interpretability.
Overfitting: Occurs when a model learns the training data and noise too well, performing poorly on new data. Feature engineering helps by providing better, more representative inputs.

Feature Creation

Combining multiple features or deriving new ones to generate more informative metrics.
Examples:
- Calculating 'average purchase value' or 'total expenditures' from raw transaction data for customer purchase prediction.
- Creating a 'price per square foot' feature from 'price' and 'square footage' for housing data.

Feature Transformation

Modifying original features to make them more suitable for model processing, often aiming for a distribution desirable for many algorithms (e.g., normal distribution or specific ranges).
Methods:
- Scaling: Adjusts the magnitude of data without changing its inherent structure.
- Normalization (Min-Max Scaling): Rescales numerical data to a specific range, commonly $[0, 1]$ or $[-1, 1]$ . The formula is $X_{norm} = \frac{X - X_{min}}{X_{max} - X_{min}}$ . This is vital when features need to contribute equally.
- Standardization: Transforms data to have a mean of $0$ and a standard deviation of $1$ . Focuses on distribution rather than fixed range, making it robust to outliers.
- Log Transformation: Replaces values with their logarithm, effectively compressing large numbers and stretching small ones. This is particularly useful for highly skewed data to achieve a more normal-like distribution.
- Binning: Divides continuous features into discrete intervals or 'bins' (e.g., grouping ages into 'young', 'middle-aged', 'senior'). This reduces complexity and helps in identifying patterns.
- Principal Component Analysis (PCA): A dimensionality reduction technique that transforms features into a smaller, uncorrelated set of components, capturing most of the original variance.
- Rank Transformation: Assigns ranks to feature values instead of using raw values. This reduces the influence of outliers and simplifies relationships between variables.

Feature Selection

Goal: Identifying and choosing only the most relevant features to improve model performance and reduce overfitting by removing redundant or irrelevant information.

Special Cases / Future Topics

Imbalanced Datasets: Briefly mentioned in the context of fraud detection where 'no fraud' cases might be undersampled to balance the dataset for better model training.
Next Class: Will focus on Exploratory Data Analysis (EDA), a critical step for understanding data before modeling.