Exploring and Pre-Processing Data
ADM3308: Business Data Mining - Exploring and Pre-Processing Data
Course Overview
Course Code: ADM3308
Focus Area: Business Data Mining
Institution: Telfer School of Management, University of Ottawa
Variable Types
The course discusses different types of variables that are essential for data mining.
Categorical Variables (Nominal):
No inherent order
Example: colors, customer types
Ordered Variables (Ordinal):
Can be rated in order but do not quantify the differences
Example: grades, seniority levels
Interval Variables:
Differences are meaningful but no true zero exists
Example: days of the year
True Numeric Variables (Continuous):
True measurement with a meaningful zero point
Example: height, weight
Structure of Data
Rows:
Called records, data points, instances, or cases (e.g., individual customers)
Columns:
Also called fields, attributes, features or dimensions (e.g., Age, Income)
Data Mining Process Model (CRISP-DM)
Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment
Variable Handling
Numeric Data:
Most algorithms can handle numeric inputs with occasional need for binning into categories.
Categorical Data:
Used as is for Naïve Bayes; converted to binary numbers for other algorithms.
Unique Value Columns
Use of columns such as Customer ID, Telephone number, Address, and Zip code is limited in mining.
They may not carry valuable algorithmic information but sometimes can yield geographical insights.
Derived Variables
Involves results of calculations such as total sales (e.g., Total Sales = Unit Price × Quantity).
Unary and Almost-Unary Columns
Unary Columns: Columns with only one unique value; should be ignored as they provide no information.
Almost-Unary Columns:
Generally lack diversity (95-99% same value); ignore unless understanding their significance is necessary.
Basic Statistical Measures
Discrete Values:
Represented through plots: line graphs, bar charts, scatterplots, and distributions like histograms.
Continuous Values:
Mean, median, mode, range, variance, standard deviation; visualized using boxplots and correlation measures.
Key Statistical Definitions
Mean: Average calculated as ext{Ave} = \frac{x1 + x2 + … + x_n}{n}
Median: Midpoint value in an ordered list (50% above, 50% below).
Mode: Most frequently occurring value in a dataset.
Range: Difference between minimum and maximum observations, calculated as \text{Range} = \text{Max} - \text{Min}.
Variance and Standard Deviation
Variance: Measure of data dispersion around the mean.
Standard Deviation: Square root of the variance, conveying how data points cluster around the mean.
Boxplots
Effective for visualizing outliers alongside overall distribution of data.
Outliers defined as those exceeding Q3 + 1.5(Q3 - Q_1).
Correlation
Measures the relationship between two variables:
Correlation coefficient r must lie in the range of [-1, +1].
Positive correlation implies that as one variable increases, so does the other. Negative correlation indicates they move in opposite directions.
Data Quality Assessment
Addresses issues like missing values, erroneous values, and inconsistencies across different datasets.
Categories include:
Missing Values: Represented by spaces or specific indicators such as "?" or "UNKNOWN" in categorical values.
Erroneous Values: Default values or invalid entries, such as negative ages or income entries.
Inconsistent Values: Variations due to cross-departmental entries, leading to discrepancies.
Data Pre-Processing Techniques
Steps necessary to prepare data include handling missing values, normalizing, binning data, and removing outliers.
Handling Missing Values:
Options: deleting records, averaging values, or utilizing k-nearest neighbors imputation methods.
Binning Data:
Two methods: equal-width (equal intervals) and equal height (equal number of cases)
Example provided demonstrates both methods using a dataset.
Normalization:
Transform data between defined ranges, using formulas such as:
For normalization: x_n = \frac{X - \text{Min}}{\text{Max} - \text{Min}}.
Standardization:
Standardizing using: z_s = \frac{X - \text{Ave}}{\text{STD}}.
Outlier Detection
Outliers: Values that significantly differ from the rest of the data, defined by being beyond \pm3\sigma (or \pm5\sigma).
Distinction made between outliers and anomalies based on context of data analysis (e.g., fraud detection).
Balancing Data
Critical when dealing with imbalanced datasets, where one class could dominate predictions.
Techniques include under-sampling the majority, over-sampling the minority, or utilizing SMOTE techniques.
References
Data Mining Techniques for Marketing: Linoff & Berry, 2011
Machine Learning for Business Analytics: Shmueli et al., 2023