Data Mining Flashcards

Data Mining

Overview

The field of data mining is vast, and this introduction only scratches the surface.
Entire courses and textbooks are dedicated to the subject.

Learning Objectives

Understand the definition and tasks associated with data mining.
Understand the concept of overfitting and how to identify it.
Interpret simple classification trees.
Calculate sensitivity and specificity using a confusion matrix.

Topics

What is Data Mining?
- Tasks
  - Prediction
  - Classification
  - Clustering
  - Market Basket Analysis
- Overfitting
- Data partitioning: training vs. testing data
- Evaluation
  - Confusion matrices
  - Overall accuracy
  - Specificity and sensitivity
- Methods
  - Logistic regression
  - Classification trees

Data Mining Defined

Data mining involves methods that discover patterns, trends, and relationships within data, especially those that are non-obvious and unexpected.
Key definitions:
- Data lake: Unstructured data in its original format.
- Data warehouse: A structured database designed for studying data patterns from multiple sources.
- Data mart: A smaller-scale data warehouse specific to a particular part of an organization.

Data Mining vs. Related Fields

Statistics: Focuses on inference from a sample to the population average and applies global structure to the data.
Machine Learning: Uses algorithms that learn directly from data, especially local patterns, often in an iterative way. Sometimes called artificial intelligence.
Data Science: Encompasses statistics, machine learning, and data mining.

Applications

Credit scoring: Predicting repayment behaviour.
Future purchases: Predicting what a customer will purchase next and tailoring promotions accordingly.
Tax evasion: The IRS is reported to have significantly increased its detection rate of tax evasion by using predictive models.
Customer turnover: Telenor, a Norwegian phone company, reduced customer turnover by predicting which customers would leave and proactively reaching out to them.
Insurance risk: Allstate improved the accuracy of predictions for injury liability claims by incorporating more information about vehicle types.
User engagement: OKCupid employs statistical models to predict the types of messages or content that are most likely to receive a response.

Common Software Packages

Pay:
- SQL Server Analysis Services (SSAS)
- Microsoft Azure
- IBM SPSS
Free & open source:
- Python
- R
- Orange (a graphic user interface to run Python, but can only handle relatively smaller datasets)

Types of Tasks

Classification:
- Predicting the class (or category) to which a record belongs.
- Example: Determining whether a prospect will convert to a customer.
Prediction:
- Predicting the value of a continuous variable.
- Multiple regression is the most commonly used method.
Cluster analysis:
- Separating data into groups such that records are similar within a group and different across groups.
Market basket analysis:
- Identifying items that are often purchased together.

Classification in Detail

Classification involves determining the class to which a record belongs.
It can be used for any number of classes, but examples often focus on two classes.
Examples:
- Predicting customer churn (whether a customer will leave in the next 6 months).
- Determining whether a person will respond to a promotional offer.
- Identifying fraudulent transactions.
- Predicting business bankruptcy.
- Forecasting supplier disruptions.

Overfitting

Overfitting occurs when a model fits the training data too well and does not generalize well to new data.
The model recreates noise and patterns specific to the training data that are not generalizable.

Data Partitioning

To avoid overfitting, use different data to build and test the model.
Training Data: Used to build the model (typically 70-80% of the original data).
Testing Data: Used to evaluate the model (typically 20-30% of the original data).

Model Evaluation

Methods to evaluate the quality of a classification model:
- Confusion matrix
- Overall accuracy
- Sensitivity & specificity

Confusion Matrix

A confusion matrix organizes the counts of records by predicted class and actual class.
It is used to calculate other evaluation measures.

Overall Accuracy

Overall Accuracy is the number of true predictions divided by the total number of records.
Accuracy = \frac{\text{Number of True Predictions}}{\text{Total Number of Records}} = \frac{\text{True Positives + True Negatives}}{\text{Total Records}}
Problem: Misclassification costs are often asymmetric, so maximizing overall accuracy does not equate to minimizing costs or maximizing profit.

Sensitivity & Specificity
- Sensitivity: How well a classifier correctly detects the important class members (also known as the true positive rate).
- Sensitivity = \frac{\text{True Positives}}{\text{Actual Positives}} = \frac{\text{True Positives}}{\text{False Negatives + True Positives}}
Specificity: How well a classifier correctly rules out the less important class members (also known as the true negative rate).
- Specificity = \frac{\text{True Negatives}}{\text{Actual Negatives}} = \frac{\text{True Negatives}}{\text{True Negatives + False Positives}}
The "important" class is usually the one that is more rare.

Methods for Classification

Classification trees (the primary method discussed in detail)
Logistic regression
K-nearest neighbors
Naïve Bayes’ algorithm
Support vector machines
Neural networks

Classification Trees

Classification trees separate records into subgroups by creating splits on predictor variables, resulting in logical if/then rules.
Splitting can continue until 100% accuracy is achieved on the training data; however, this leads to extreme overfitting and poor performance on testing data.

Example: Predicting Loan Acceptance

A bank aims to predict which customers will accept a loan offer, using historical data on loan acceptance, income, education level, and family size.
A classification tree can be used to classify records as either non-acceptors or acceptors.

Types of Nodes in Classification Tree

Decision (splitting) node: Splits data into subgroups and has successors (nodes below it).
Terminal node (leaf): Contains the total count and count of each class (e.g., [Class 0, Class 1]).

Information in Each Node

Nodes contain:
- Splitting condition (if a splitting node)
- Number of records
- Number of each class [Class0, Class1]
Arrows:
- Left is “TRUE”
- Right is “FALSE”

Translation to Logical Rules (Example)

IF (Income ≤ 110.5) THEN Class = 0 (non-acceptor)
IF (Income > 110.5) AND (Education ≤ 0.5) AND (Family ≤ 2.5) THEN Class = 0 (non-acceptor)
IF (Income >116.5) AND (Education > 0.5) THEN Class = 1 (acceptor)

Advantages & Disadvantages

Advantages:
- Work with quantitative and/or categorical variables
- Straightforward interpretation
Disadvantages:
- Easy to over-fit the data
- "Favors" predictors with many split points (i.e., those with wide ranges)