Exploring and Pre-Processing Data

ADM3308: Business Data Mining - Exploring and Pre-Processing Data


Course Overview

  • Course Code: ADM3308

  • Focus Area: Business Data Mining

  • Institution: Telfer School of Management, University of Ottawa


Variable Types

  • The course discusses different types of variables that are essential for data mining.

    • Categorical Variables (Nominal):

    • No inherent order

    • Example: colors, customer types

    • Ordered Variables (Ordinal):

    • Can be rated in order but do not quantify the differences

    • Example: grades, seniority levels

    • Interval Variables:

    • Differences are meaningful but no true zero exists

    • Example: days of the year

    • True Numeric Variables (Continuous):

    • True measurement with a meaningful zero point

    • Example: height, weight


Structure of Data

  • Rows:

    • Called records, data points, instances, or cases (e.g., individual customers)

  • Columns:

    • Also called fields, attributes, features or dimensions (e.g., Age, Income)


Data Mining Process Model (CRISP-DM)

  • Business Understanding

  • Data Understanding

  • Data Preparation

  • Modeling

  • Evaluation

  • Deployment


Variable Handling

  • Numeric Data:

    • Most algorithms can handle numeric inputs with occasional need for binning into categories.

  • Categorical Data:

    • Used as is for Naïve Bayes; converted to binary numbers for other algorithms.


Unique Value Columns

  • Use of columns such as Customer ID, Telephone number, Address, and Zip code is limited in mining.

  • They may not carry valuable algorithmic information but sometimes can yield geographical insights.


Derived Variables

  • Involves results of calculations such as total sales (e.g., Total Sales = Unit Price × Quantity).


Unary and Almost-Unary Columns

  • Unary Columns: Columns with only one unique value; should be ignored as they provide no information.

  • Almost-Unary Columns:

    • Generally lack diversity (95-99% same value); ignore unless understanding their significance is necessary.


Basic Statistical Measures

  • Discrete Values:

    • Represented through plots: line graphs, bar charts, scatterplots, and distributions like histograms.

  • Continuous Values:

    • Mean, median, mode, range, variance, standard deviation; visualized using boxplots and correlation measures.


Key Statistical Definitions

  • Mean: Average calculated as ext{Ave} = \frac{x1 + x2 + … + x_n}{n}

  • Median: Midpoint value in an ordered list (50% above, 50% below).

  • Mode: Most frequently occurring value in a dataset.

  • Range: Difference between minimum and maximum observations, calculated as \text{Range} = \text{Max} - \text{Min}.


Variance and Standard Deviation

  • Variance: Measure of data dispersion around the mean.

  • Standard Deviation: Square root of the variance, conveying how data points cluster around the mean.


Boxplots

  • Effective for visualizing outliers alongside overall distribution of data.

    • Outliers defined as those exceeding Q3 + 1.5(Q3 - Q_1).


Correlation

  • Measures the relationship between two variables:

    • Correlation coefficient r must lie in the range of [-1, +1].

    • Positive correlation implies that as one variable increases, so does the other. Negative correlation indicates they move in opposite directions.


Data Quality Assessment

  • Addresses issues like missing values, erroneous values, and inconsistencies across different datasets.

    • Categories include:

    • Missing Values: Represented by spaces or specific indicators such as "?" or "UNKNOWN" in categorical values.

    • Erroneous Values: Default values or invalid entries, such as negative ages or income entries.

    • Inconsistent Values: Variations due to cross-departmental entries, leading to discrepancies.


Data Pre-Processing Techniques

  • Steps necessary to prepare data include handling missing values, normalizing, binning data, and removing outliers.

    • Handling Missing Values:

    • Options: deleting records, averaging values, or utilizing k-nearest neighbors imputation methods.

    • Binning Data:

    • Two methods: equal-width (equal intervals) and equal height (equal number of cases)

    • Example provided demonstrates both methods using a dataset.

    • Normalization:

    • Transform data between defined ranges, using formulas such as:

      • For normalization: x_n = \frac{X - \text{Min}}{\text{Max} - \text{Min}}.

    • Standardization:

    • Standardizing using: z_s = \frac{X - \text{Ave}}{\text{STD}}.


Outlier Detection

  • Outliers: Values that significantly differ from the rest of the data, defined by being beyond \pm3\sigma (or \pm5\sigma).

    • Distinction made between outliers and anomalies based on context of data analysis (e.g., fraud detection).


Balancing Data

  • Critical when dealing with imbalanced datasets, where one class could dominate predictions.

    • Techniques include under-sampling the majority, over-sampling the minority, or utilizing SMOTE techniques.


References

  • Data Mining Techniques for Marketing: Linoff & Berry, 2011

  • Machine Learning for Business Analytics: Shmueli et al., 2023