Data Mining Study Notes
Chapter 11 Introduction to Data Mining
Learning Objectives (LOs)
LO 11.1: Describe the data mining process.
LO 11.2: Implement similarity measures.
LO 11.3: Assess the predictive performance of data mining models (SKIP).
LO 11.4: Conduct principal component analysis (SKIP).
Introductory Case: Social Media Marketing 1
Company Profile:
Name: FashionTech
Position: Alissa Bridges serves as the marketing director.
Product Focus: Activewear for both men and women.
Target Market: Individuals aged 18–35 with an active and/or outdoors lifestyle.
Marketing Channels:
TV ads
Quarterly catalogs
Product placements
Search engines
Social media
Introductory Case: Social Media Marketing 2
Predictive Models Developed by MarketWiz:
Classification Model: Predicts the probability of potential customers purchasing from FashionTech within 30 days of receiving a promotional message via social media.
Prediction Models:
Predict the one-year purchase amounts of customers acquired through social media channels.
Model Assessment: Alissa’s team wants to use the validation data set for:
Evaluating how accurately the classification model categorizes potential customers into purchase and no-purchase classes.
Comparing the performance of prediction models related to one-year purchase amounts acquired through social media.
11.1: Data Mining Overview 1
Terminology:
Artificial intelligence, machine learning, and data mining are often used interchangeably.
Their definitions overlap, lacking clear boundaries, and pertain to applications of computer software for solutions unattainable by traditional data analysis.
Artificial Intelligence (AI):
Describes computer systems demonstrating
Deduction
Pattern recognition
Interpretation of complex data
11.1: Data Mining Overview 2
Machine Learning:
Integrates self-learning algorithms.
Functionality: Allows computers to learn independently without human intervention.
Purpose: Evaluates results and improves performance over time.
Example: Predicting rider demand for optimal driver dispatch at Uber.
11.1: Data Mining Overview 3
Data Mining:
The application of analytical techniques necessary for developing AI and machine learning.
Recognized as a foundational component of machine learning and AI with the goals of uncovering hidden patterns and gaining insights.
Applications include:
Data segmentation
Pattern recognition
Classification
Prediction
Example: Grouping customers into segments for customized promotions.
11.1: Data Mining Overview 4
Nature of Data Mining:
Complex process of analyzing data and applying analytical techniques for insights.
Requires a systematic approach to manage and conduct data mining projects.
Preferred Methodology: CRISP-DM (Cross-Industry Standard Process for Data Mining).
Focuses on business objectives before preparing data and choosing analysis techniques.
11.1: Data Mining Overview 5
CRISP-DM Development:
Developed in the 1990s by five companies:
SPSS
TeraData
Daimler AG
NCR
OHRA
Phases of CRISP-DM:
Business Understanding: Gathers situational context, outlines objectives, schedules, and deliverables.
Data Understanding: Involves collecting raw data and generating preliminary results and potential hypotheses.
Data Preparation: Entails record and variable selection, wrangling, and cleansing.
Modeling: Encompasses selection/execution of data mining techniques, data transformation, documentation of assumptions, and cross-validation.
Evaluation: Evaluates competing models’ performance, selects best models, reviews/interprets results, and develops recommendations.
Deployment: Focuses on developing actionable insights and strategy for deployment, monitoring, and feedback
11.1: Data Mining Overview 6
Phases of CRISP-DM:
Business understanding
Data understanding
Data preparation
Modeling
Evaluation
Deployment
11.1: Data Mining Overview 7
Application Flexibility:
Not every phase is necessary for all data mining applications.
Data Preparation Hesitance: Often requires a significant portion of project time (approximately 80%) for understanding, cleansing, transforming, and preparing data before modeling.
Other Methodologies:
SEMMA (Sample, Explore, Modify, Model, Assess)
KDD (Knowledge Discovery in Databases)
11.1: Data Mining Overview 8
Data Mining Algorithms: Classified into two techniques based on learning methods:
Supervised Data Mining: Focuses on developing predictive models.
Definition: Target variable is identified.
Example: In regression, the target variable serves as the response variable with historical values in the dataset.
Unsupervised Data Mining: Focuses on exploration, dimension reduction, and pattern recognition without any identified target variable.
11.1: Data Mining Overview 9
Common Supervised Algorithms:
Based on classic statistical techniques like linear regression and logistic regression.
Use predictors to make predictions or describe changes in the target variable ($y$).
Training Supervised Models: Known target variable values build the model.
Model performance evaluated based on predicted vs. actual value deviation.
11.1: Data Mining Overview 10
Applications of Supervised Data Mining:
Classification Models:
Target variable is categorical (e.g., classify stock recommendations: buy, hold, sell).
Prediction Models:
Target variable is numerical (e.g., predict spending of a customer).
Other Machine Learning Algorithms:
k-Nearest Neighbors
Naïve Bayes
Decision Trees
11.1: Data Mining Overview 11
Unsupervised Data Mining:
No knowledge of the target variable is required.
Algorithms identify patterns and relationships without analyst guidance.
Key in exploratory data analysis and descriptive analytics, used before supervised learning for insight and data summarization.
Applications: Dimension reduction and pattern recognition.
11.1: Data Mining Overview 12
Dimension Reduction:
Converts high-dimensional data into lesser dimensions while retaining significant information.
Benefits: Reduces information redundancy and enhances model stability, especially in big data contexts to identify patterns and robust model building.
Pattern Recognition:
Identifies recurring sequences and recognizable features from data.
11.2: Similarity Measures 1
Definition of Similarity Measures:
Evaluate the similarity or dissimilarity of observations based on distance metrics between pairwise observations of variables.
Interpretation:
Small distances imply high similarity.
Large distances imply low similarity.
Represent all variables as
X_k (for k variables).
11.2: Similarity Measures 2
Example Illustrating Measure:
Consider observations on two variables represented on a plane.
11.2: Similarity Measures 3
Distance Metrics:
Euclidean Distance:
Most widely used measure assessing the distance between two observations ($i$ and $j$).
Manhattan Distance:
Shortest horizontal and vertical path comparison between observations ($i$ and $j$).
Influence of Outliers:
Euclidean distance is more affected by outliers compared to Manhattan distance.
11.2: Similarity Measures 4
Distance Calculation Example:
Calculate and interpret the Euclidean and Manhattan distances between given observations.
11.2: Similarity Measures 5
Example Calculation Results:
Euclidean:
Distance between Observations 1 and 2: (calculate)
Distance between Observations 1 and 3: 7.62
Distance between Observations 2 and 3: 7.21
Manhattan:
Distance between Observations 1 and 2: (calculate)
Distance between Observations 1 and 3: 10
Distance between Observations 2 and 3: 10
11.2: Similarity Measures 6
Influences on Distance Measures:
Scale of each variable can affect calculated distances.
Different scales can distort the true distances yielding inaccurate results.
To negate this, standardize variables to be unit-free ensuring equal weight during distance calculations.
Standardization Methods:
Z-scores: Distance from mean expressed in standard deviation.
Min-max normalization: Rescale each observation to a 0-1 range.
11.2: Similarity Measures 7
Example Using Consumer Data:
Five consumers' annual income and hours spent online analyzed for distance measures.
Standardized Observations Result:
Data for each individual with standardized and normalized values provided
Jane: Income: 125,678, Hours: 2.5 → Standardized: 1.2473, -1.5071; Normalized: 1.0000, 0.0000
Kevin: Income: 65,901, Hours: 10.1 → Standardized: -1.1892, 1.0382; Normalized: 0.0000, 1.0000
Dolores: Income: 75,550, Hours: 5.8 → Standardized: -0.7959, -0.4019; Normalized: 0.1614, 0.4342
Deshaun: Income: 110,250, Hours: 9.0 → Standardized: 0.6184, 0.6698; Normalized: 0.7419, 0.8553
Mei: Income: 98,005, Hours: 7.6 → Standardized: 0.1194, 0.2010; Normalized: 0.5371, 0.6711
11.2: Similarity Measures 8
Applicability of Distance Measures:
Euclidean and Manhattan Distances: Suitable for numerical variables.
Categorical Variables: Require different measures of similarity.
Binary Variable: A categorical variable with two categories.
Common Categorical Measures:
Matching coefficient
Jaccard’s coefficient
11.2: Similarity Measures 9
Matching Coefficient:
Based on value matching to determine similarities in categorical data.
Higher values signify more similarity, while a value of 1 indicates a perfect match.
11.2: Similarity Measures 10
Example Calculation of Matching Coefficients:
For a group of college students with major, field, gender, and Dean’s List status, compute matching coefficients and analyze student pair similarities.
Sample Student Records:
Student 1: Major: Business, Field: MIS, Sex: Female, Dean’s List: Yes
Student 2: Major: Engineering, Field: Electrical, Sex: Male, Dean’s List: Yes
Student 3: Major: Business, Field: Accounting, Sex: Female, Dean’s List: No
11.2: Similarity Measures 11
Similarity Analysis Results:
Compute matching coefficients among student pairs for:
Student 1 and 2
Student 1 and 3
Student 2 and 3
11.2: Similarity Measures 12
Limitations of the Matching Coefficient:
Does not differentiate between positive and negative outcomes potentially leading to misleading measures.
Jaccard’s Coefficient:
Excludes negative outcomes providing a purer measure of similarity.
11.2: Similarity Measures 13
Example in Retail Transaction Data:
Retail transactions include products purchased denoted as 'yes' for a purchase and 'no' for absence of purchase.
Transaction Examples:
Transaction 1: Items - Keyboard (Yes), Memory card (No), Mouse (Yes), USB drive (No), Headphone (Yes)
Transaction 2: Items - Keyboard (Yes), Memory card (Yes), Mouse (Yes), USB drive (No), Headphone (No)
Transaction 3: Items - Keyboard (No), Memory card (No), Mouse (No), USB drive (No), Headphone (Yes)
11.2: Similarity Measures 14
Continuing Retail Transaction Analysis: Analyze Jaccard’s coefficients alongside matching coefficients for transaction pairwise sets.
Conclusion
All rights reserved. No reproduction or distribution without the prior written consent of McGraw Hill LLC.