Data Mining Flashcards

Data Mining

Overview

  • The field of data mining is vast, and this introduction only scratches the surface.

  • Entire courses and textbooks are dedicated to the subject.

Learning Objectives

  • Understand the definition and tasks associated with data mining.

  • Understand the concept of overfitting and how to identify it.

  • Interpret simple classification trees.

  • Calculate sensitivity and specificity using a confusion matrix.

Topics

  • What is Data Mining?

    • Tasks

      • Prediction

      • Classification

      • Clustering

      • Market Basket Analysis

    • Overfitting

    • Data partitioning: training vs. testing data

    • Evaluation

      • Confusion matrices

      • Overall accuracy

      • Specificity and sensitivity

    • Methods

      • Logistic regression

      • Classification trees

Data Mining Defined

  • Data mining involves methods that discover patterns, trends, and relationships within data, especially those that are non-obvious and unexpected.

  • Key definitions:

    • Data lake: Unstructured data in its original format.

    • Data warehouse: A structured database designed for studying data patterns from multiple sources.

    • Data mart: A smaller-scale data warehouse specific to a particular part of an organization.

Data Mining vs. Related Fields

  • Statistics: Focuses on inference from a sample to the population average and applies global structure to the data.

  • Machine Learning: Uses algorithms that learn directly from data, especially local patterns, often in an iterative way. Sometimes called artificial intelligence.

  • Data Science: Encompasses statistics, machine learning, and data mining.

Applications

  • Credit scoring: Predicting repayment behaviour.

  • Future purchases: Predicting what a customer will purchase next and tailoring promotions accordingly.

  • Tax evasion: The IRS is reported to have significantly increased its detection rate of tax evasion by using predictive models.

  • Customer turnover: Telenor, a Norwegian phone company, reduced customer turnover by predicting which customers would leave and proactively reaching out to them.

  • Insurance risk: Allstate improved the accuracy of predictions for injury liability claims by incorporating more information about vehicle types.

  • User engagement: OKCupid employs statistical models to predict the types of messages or content that are most likely to receive a response.

Common Software Packages

  • Pay:

    • SQL Server Analysis Services (SSAS)

    • Microsoft Azure

    • IBM SPSS

  • Free & open source:

    • Python

    • R

    • Orange (a graphic user interface to run Python, but can only handle relatively smaller datasets)

Types of Tasks

  • Classification:

    • Predicting the class (or category) to which a record belongs.

    • Example: Determining whether a prospect will convert to a customer.

  • Prediction:

    • Predicting the value of a continuous variable.

    • Multiple regression is the most commonly used method.

  • Cluster analysis:

    • Separating data into groups such that records are similar within a group and different across groups.

  • Market basket analysis:

    • Identifying items that are often purchased together.

Classification in Detail

  • Classification involves determining the class to which a record belongs.

  • It can be used for any number of classes, but examples often focus on two classes.

  • Examples:

    • Predicting customer churn (whether a customer will leave in the next 6 months).

    • Determining whether a person will respond to a promotional offer.

    • Identifying fraudulent transactions.

    • Predicting business bankruptcy.

    • Forecasting supplier disruptions.

Overfitting

  • Overfitting occurs when a model fits the training data too well and does not generalize well to new data.

  • The model recreates noise and patterns specific to the training data that are not generalizable.

Data Partitioning

  • To avoid overfitting, use different data to build and test the model.

  • Training Data: Used to build the model (typically 70-80% of the original data).

  • Testing Data: Used to evaluate the model (typically 20-30% of the original data).

Model Evaluation

  • Methods to evaluate the quality of a classification model:

    • Confusion matrix

    • Overall accuracy

    • Sensitivity & specificity

Confusion Matrix

  • A confusion matrix organizes the counts of records by predicted class and actual class.

  • It is used to calculate other evaluation measures.

Overall Accuracy

  • Overall Accuracy is the number of true predictions divided by the total number of records.

  • Accuracy = \frac{\text{Number of True Predictions}}{\text{Total Number of Records}} = \frac{\text{True Positives + True Negatives}}{\text{Total Records}}

  • Problem: Misclassification costs are often asymmetric, so maximizing overall accuracy does not equate to minimizing costs or maximizing profit.

  • Sensitivity & Specificity

    • Sensitivity: How well a classifier correctly detects the important class members (also known as the true positive rate).

    • Sensitivity = \frac{\text{True Positives}}{\text{Actual Positives}} = \frac{\text{True Positives}}{\text{False Negatives + True Positives}}

  • Specificity: How well a classifier correctly rules out the less important class members (also known as the true negative rate).

    • Specificity = \frac{\text{True Negatives}}{\text{Actual Negatives}} = \frac{\text{True Negatives}}{\text{True Negatives + False Positives}}

  • The "important" class is usually the one that is more rare.

Methods for Classification

  • Classification trees (the primary method discussed in detail)

  • Logistic regression

  • K-nearest neighbors

  • Naïve Bayes’ algorithm

  • Support vector machines

  • Neural networks

Classification Trees

  • Classification trees separate records into subgroups by creating splits on predictor variables, resulting in logical if/then rules.

  • Splitting can continue until 100% accuracy is achieved on the training data; however, this leads to extreme overfitting and poor performance on testing data.

Example: Predicting Loan Acceptance

  • A bank aims to predict which customers will accept a loan offer, using historical data on loan acceptance, income, education level, and family size.

  • A classification tree can be used to classify records as either non-acceptors or acceptors.

Types of Nodes in Classification Tree

  • Decision (splitting) node: Splits data into subgroups and has successors (nodes below it).

  • Terminal node (leaf): Contains the total count and count of each class (e.g., [Class 0, Class 1]).

Information in Each Node

  • Nodes contain:

    • Splitting condition (if a splitting node)

    • Number of records

    • Number of each class [Class0, Class1]

  • Arrows:

    • Left is “TRUE”

    • Right is “FALSE”

Translation to Logical Rules (Example)

  • IF (Income ≤ 110.5) THEN Class = 0 (non-acceptor)

  • IF (Income > 110.5) AND (Education ≤ 0.5) AND (Family ≤ 2.5) THEN Class = 0 (non-acceptor)

  • IF (Income >116.5) AND (Education > 0.5) THEN Class = 1 (acceptor)

Advantages & Disadvantages

  • Advantages:

    • Work with quantitative and/or categorical variables

    • Straightforward interpretation

  • Disadvantages:

    • Easy to over-fit the data

    • "Favors" predictors with many split points (i.e., those with wide ranges)