Exploring and Pre-Processing Data

ADM3308: Business Data Mining - Exploring and Pre-Processing Data

Course Overview

Course Code: ADM3308
Focus Area: Business Data Mining
Institution: Telfer School of Management, University of Ottawa

Variable Types

The course discusses different types of variables that are essential for data mining.
- Categorical Variables (Nominal):
- No inherent order
- Example: colors, customer types
- Ordered Variables (Ordinal):
- Can be rated in order but do not quantify the differences
- Example: grades, seniority levels
- Interval Variables:
- Differences are meaningful but no true zero exists
- Example: days of the year
- True Numeric Variables (Continuous):
- True measurement with a meaningful zero point
- Example: height, weight

Structure of Data

Rows:
- Called records, data points, instances, or cases (e.g., individual customers)
Columns:
- Also called fields, attributes, features or dimensions (e.g., Age, Income)

Data Mining Process Model (CRISP-DM)

Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment

Variable Handling

Numeric Data:
- Most algorithms can handle numeric inputs with occasional need for binning into categories.
Categorical Data:
- Used as is for Naïve Bayes; converted to binary numbers for other algorithms.

Unique Value Columns

Use of columns such as Customer ID, Telephone number, Address, and Zip code is limited in mining.
They may not carry valuable algorithmic information but sometimes can yield geographical insights.

Derived Variables

Involves results of calculations such as total sales (e.g., Total Sales = Unit Price × Quantity).

Unary and Almost-Unary Columns

Unary Columns: Columns with only one unique value; should be ignored as they provide no information.
Almost-Unary Columns:
- Generally lack diversity (95-99% same value); ignore unless understanding their significance is necessary.

Basic Statistical Measures

Discrete Values:
- Represented through plots: line graphs, bar charts, scatterplots, and distributions like histograms.
Continuous Values:
- Mean, median, mode, range, variance, standard deviation; visualized using boxplots and correlation measures.

Key Statistical Definitions

Mean: Average calculated as ext{Ave} = \frac{x1 + x2 + … + x_n}{n}
Median: Midpoint value in an ordered list (50% above, 50% below).
Mode: Most frequently occurring value in a dataset.
Range: Difference between minimum and maximum observations, calculated as \text{Range} = \text{Max} - \text{Min}.

Variance and Standard Deviation

Variance: Measure of data dispersion around the mean.
Standard Deviation: Square root of the variance, conveying how data points cluster around the mean.

Boxplots

Effective for visualizing outliers alongside overall distribution of data.
- Outliers defined as those exceeding Q3 + 1.5(Q3 - Q_1).

Correlation

Measures the relationship between two variables:
- Correlation coefficient r must lie in the range of [-1, +1].
- Positive correlation implies that as one variable increases, so does the other. Negative correlation indicates they move in opposite directions.

Data Quality Assessment

Addresses issues like missing values, erroneous values, and inconsistencies across different datasets.
- Categories include:
- Missing Values: Represented by spaces or specific indicators such as "?" or "UNKNOWN" in categorical values.
- Erroneous Values: Default values or invalid entries, such as negative ages or income entries.
- Inconsistent Values: Variations due to cross-departmental entries, leading to discrepancies.

Data Pre-Processing Techniques

Steps necessary to prepare data include handling missing values, normalizing, binning data, and removing outliers.
- Handling Missing Values:
- Options: deleting records, averaging values, or utilizing k-nearest neighbors imputation methods.
- Binning Data:
- Two methods: equal-width (equal intervals) and equal height (equal number of cases)
- Example provided demonstrates both methods using a dataset.
- Normalization:
- Transform data between defined ranges, using formulas such as:
  - For normalization: x_n = \frac{X - \text{Min}}{\text{Max} - \text{Min}}.
- Standardization:
- Standardizing using: z_s = \frac{X - \text{Ave}}{\text{STD}}.

Outlier Detection

Outliers: Values that significantly differ from the rest of the data, defined by being beyond \pm3\sigma (or \pm5\sigma).
- Distinction made between outliers and anomalies based on context of data analysis (e.g., fraud detection).

Balancing Data

Critical when dealing with imbalanced datasets, where one class could dominate predictions.
- Techniques include under-sampling the majority, over-sampling the minority, or utilizing SMOTE techniques.

References

Data Mining Techniques for Marketing: Linoff & Berry, 2011
Machine Learning for Business Analytics: Shmueli et al., 2023