The field of data mining is vast, and this introduction only scratches the surface.
Entire courses and textbooks are dedicated to the subject.
Understand the definition and tasks associated with data mining.
Understand the concept of overfitting and how to identify it.
Interpret simple classification trees.
Calculate sensitivity and specificity using a confusion matrix.
What is Data Mining?
Tasks
Prediction
Classification
Clustering
Market Basket Analysis
Overfitting
Data partitioning: training vs. testing data
Evaluation
Confusion matrices
Overall accuracy
Specificity and sensitivity
Methods
Logistic regression
Classification trees
Data mining involves methods that discover patterns, trends, and relationships within data, especially those that are non-obvious and unexpected.
Key definitions:
Data lake: Unstructured data in its original format.
Data warehouse: A structured database designed for studying data patterns from multiple sources.
Data mart: A smaller-scale data warehouse specific to a particular part of an organization.
Statistics: Focuses on inference from a sample to the population average and applies global structure to the data.
Machine Learning: Uses algorithms that learn directly from data, especially local patterns, often in an iterative way. Sometimes called artificial intelligence.
Data Science: Encompasses statistics, machine learning, and data mining.
Credit scoring: Predicting repayment behaviour.
Future purchases: Predicting what a customer will purchase next and tailoring promotions accordingly.
Tax evasion: The IRS is reported to have significantly increased its detection rate of tax evasion by using predictive models.
Customer turnover: Telenor, a Norwegian phone company, reduced customer turnover by predicting which customers would leave and proactively reaching out to them.
Insurance risk: Allstate improved the accuracy of predictions for injury liability claims by incorporating more information about vehicle types.
User engagement: OKCupid employs statistical models to predict the types of messages or content that are most likely to receive a response.
Pay:
SQL Server Analysis Services (SSAS)
Microsoft Azure
IBM SPSS
Free & open source:
Python
R
Orange (a graphic user interface to run Python, but can only handle relatively smaller datasets)
Classification:
Predicting the class (or category) to which a record belongs.
Example: Determining whether a prospect will convert to a customer.
Prediction:
Predicting the value of a continuous variable.
Multiple regression is the most commonly used method.
Cluster analysis:
Separating data into groups such that records are similar within a group and different across groups.
Market basket analysis:
Identifying items that are often purchased together.
Classification involves determining the class to which a record belongs.
It can be used for any number of classes, but examples often focus on two classes.
Examples:
Predicting customer churn (whether a customer will leave in the next 6 months).
Determining whether a person will respond to a promotional offer.
Identifying fraudulent transactions.
Predicting business bankruptcy.
Forecasting supplier disruptions.
Overfitting occurs when a model fits the training data too well and does not generalize well to new data.
The model recreates noise and patterns specific to the training data that are not generalizable.
To avoid overfitting, use different data to build and test the model.
Training Data: Used to build the model (typically 70-80% of the original data).
Testing Data: Used to evaluate the model (typically 20-30% of the original data).
Methods to evaluate the quality of a classification model:
Confusion matrix
Overall accuracy
Sensitivity & specificity
A confusion matrix organizes the counts of records by predicted class and actual class.
It is used to calculate other evaluation measures.
Overall Accuracy is the number of true predictions divided by the total number of records.
Accuracy = \frac{\text{Number of True Predictions}}{\text{Total Number of Records}} = \frac{\text{True Positives + True Negatives}}{\text{Total Records}}
Problem: Misclassification costs are often asymmetric, so maximizing overall accuracy does not equate to minimizing costs or maximizing profit.
Sensitivity & Specificity
Sensitivity: How well a classifier correctly detects the important class members (also known as the true positive rate).
Sensitivity = \frac{\text{True Positives}}{\text{Actual Positives}} = \frac{\text{True Positives}}{\text{False Negatives + True Positives}}
Specificity: How well a classifier correctly rules out the less important class members (also known as the true negative rate).
Specificity = \frac{\text{True Negatives}}{\text{Actual Negatives}} = \frac{\text{True Negatives}}{\text{True Negatives + False Positives}}
The "important" class is usually the one that is more rare.
Classification trees (the primary method discussed in detail)
Logistic regression
K-nearest neighbors
Naïve Bayes’ algorithm
Support vector machines
Neural networks
Classification trees separate records into subgroups by creating splits on predictor variables, resulting in logical if/then rules.
Splitting can continue until 100% accuracy is achieved on the training data; however, this leads to extreme overfitting and poor performance on testing data.
A bank aims to predict which customers will accept a loan offer, using historical data on loan acceptance, income, education level, and family size.
A classification tree can be used to classify records as either non-acceptors or acceptors.
Decision (splitting) node: Splits data into subgroups and has successors (nodes below it).
Terminal node (leaf): Contains the total count and count of each class (e.g., [Class 0, Class 1]).
Nodes contain:
Splitting condition (if a splitting node)
Number of records
Number of each class [Class0, Class1]
Arrows:
Left is “TRUE”
Right is “FALSE”
IF (Income ≤ 110.5) THEN Class = 0 (non-acceptor)
IF (Income > 110.5) AND (Education ≤ 0.5) AND (Family ≤ 2.5) THEN Class = 0 (non-acceptor)
IF (Income >116.5) AND (Education > 0.5) THEN Class = 1 (acceptor)
Advantages:
Work with quantitative and/or categorical variables
Straightforward interpretation
Disadvantages:
Easy to over-fit the data
"Favors" predictors with many split points (i.e., those with wide ranges)