Data Mining Study Notes

Chapter 11 Introduction to Data Mining


Learning Objectives (LOs)

  • LO 11.1: Describe the data mining process.

  • LO 11.2: Implement similarity measures.

  • LO 11.3: Assess the predictive performance of data mining models (SKIP).

  • LO 11.4: Conduct principal component analysis (SKIP).


Introductory Case: Social Media Marketing 1

  • Company Profile:

    • Name: FashionTech

    • Position: Alissa Bridges serves as the marketing director.

    • Product Focus: Activewear for both men and women.

    • Target Market: Individuals aged 18–35 with an active and/or outdoors lifestyle.

    • Marketing Channels:

    • TV ads

    • Quarterly catalogs

    • Product placements

    • Search engines

    • Social media


Introductory Case: Social Media Marketing 2

  • Predictive Models Developed by MarketWiz:

    • Classification Model: Predicts the probability of potential customers purchasing from FashionTech within 30 days of receiving a promotional message via social media.

    • Prediction Models:

    • Predict the one-year purchase amounts of customers acquired through social media channels.

  • Model Assessment: Alissa’s team wants to use the validation data set for:

    1. Evaluating how accurately the classification model categorizes potential customers into purchase and no-purchase classes.

    2. Comparing the performance of prediction models related to one-year purchase amounts acquired through social media.


11.1: Data Mining Overview 1

  • Terminology:

    • Artificial intelligence, machine learning, and data mining are often used interchangeably.

    • Their definitions overlap, lacking clear boundaries, and pertain to applications of computer software for solutions unattainable by traditional data analysis.

  • Artificial Intelligence (AI):

    • Describes computer systems demonstrating

    • Deduction

    • Pattern recognition

    • Interpretation of complex data


11.1: Data Mining Overview 2

  • Machine Learning:

    • Integrates self-learning algorithms.

    • Functionality: Allows computers to learn independently without human intervention.

    • Purpose: Evaluates results and improves performance over time.

    • Example: Predicting rider demand for optimal driver dispatch at Uber.


11.1: Data Mining Overview 3

  • Data Mining:

    • The application of analytical techniques necessary for developing AI and machine learning.

    • Recognized as a foundational component of machine learning and AI with the goals of uncovering hidden patterns and gaining insights.

    • Applications include:

    • Data segmentation

    • Pattern recognition

    • Classification

    • Prediction

    • Example: Grouping customers into segments for customized promotions.


11.1: Data Mining Overview 4

  • Nature of Data Mining:

    • Complex process of analyzing data and applying analytical techniques for insights.

    • Requires a systematic approach to manage and conduct data mining projects.

    • Preferred Methodology: CRISP-DM (Cross-Industry Standard Process for Data Mining).

    • Focuses on business objectives before preparing data and choosing analysis techniques.


11.1: Data Mining Overview 5

  • CRISP-DM Development:

    • Developed in the 1990s by five companies:

    • SPSS

    • TeraData

    • Daimler AG

    • NCR

    • OHRA

    • Phases of CRISP-DM:

    1. Business Understanding: Gathers situational context, outlines objectives, schedules, and deliverables.

    2. Data Understanding: Involves collecting raw data and generating preliminary results and potential hypotheses.

    3. Data Preparation: Entails record and variable selection, wrangling, and cleansing.

    4. Modeling: Encompasses selection/execution of data mining techniques, data transformation, documentation of assumptions, and cross-validation.

    5. Evaluation: Evaluates competing models’ performance, selects best models, reviews/interprets results, and develops recommendations.

    6. Deployment: Focuses on developing actionable insights and strategy for deployment, monitoring, and feedback


11.1: Data Mining Overview 6

  • Phases of CRISP-DM:

    1. Business understanding

    2. Data understanding

    3. Data preparation

    4. Modeling

    5. Evaluation

    6. Deployment


11.1: Data Mining Overview 7

  • Application Flexibility:

    • Not every phase is necessary for all data mining applications.

    • Data Preparation Hesitance: Often requires a significant portion of project time (approximately 80%) for understanding, cleansing, transforming, and preparing data before modeling.

  • Other Methodologies:

    • SEMMA (Sample, Explore, Modify, Model, Assess)

    • KDD (Knowledge Discovery in Databases)


11.1: Data Mining Overview 8

  • Data Mining Algorithms: Classified into two techniques based on learning methods:

    • Supervised Data Mining: Focuses on developing predictive models.

    • Definition: Target variable is identified.

    • Example: In regression, the target variable serves as the response variable with historical values in the dataset.

    • Unsupervised Data Mining: Focuses on exploration, dimension reduction, and pattern recognition without any identified target variable.


11.1: Data Mining Overview 9

  • Common Supervised Algorithms:

    • Based on classic statistical techniques like linear regression and logistic regression.

    • Use predictors to make predictions or describe changes in the target variable ($y$).

    • Training Supervised Models: Known target variable values build the model.

    • Model performance evaluated based on predicted vs. actual value deviation.


11.1: Data Mining Overview 10

  • Applications of Supervised Data Mining:

    • Classification Models:

    • Target variable is categorical (e.g., classify stock recommendations: buy, hold, sell).

    • Prediction Models:

    • Target variable is numerical (e.g., predict spending of a customer).

    • Other Machine Learning Algorithms:

    • k-Nearest Neighbors

    • Naïve Bayes

    • Decision Trees


11.1: Data Mining Overview 11

  • Unsupervised Data Mining:

    • No knowledge of the target variable is required.

    • Algorithms identify patterns and relationships without analyst guidance.

    • Key in exploratory data analysis and descriptive analytics, used before supervised learning for insight and data summarization.

    • Applications: Dimension reduction and pattern recognition.


11.1: Data Mining Overview 12

  • Dimension Reduction:

    • Converts high-dimensional data into lesser dimensions while retaining significant information.

    • Benefits: Reduces information redundancy and enhances model stability, especially in big data contexts to identify patterns and robust model building.

  • Pattern Recognition:

    • Identifies recurring sequences and recognizable features from data.


11.2: Similarity Measures 1

  • Definition of Similarity Measures:

    • Evaluate the similarity or dissimilarity of observations based on distance metrics between pairwise observations of variables.

    • Interpretation:

    • Small distances imply high similarity.

    • Large distances imply low similarity.

    • Represent all variables as
      X_k (for k variables).


11.2: Similarity Measures 2

  • Example Illustrating Measure:

    • Consider observations on two variables represented on a plane.


11.2: Similarity Measures 3

  • Distance Metrics:

    • Euclidean Distance:

    • Most widely used measure assessing the distance between two observations ($i$ and $j$).

    • Manhattan Distance:

    • Shortest horizontal and vertical path comparison between observations ($i$ and $j$).

    • Influence of Outliers:

    • Euclidean distance is more affected by outliers compared to Manhattan distance.


11.2: Similarity Measures 4

  • Distance Calculation Example:

    • Calculate and interpret the Euclidean and Manhattan distances between given observations.


11.2: Similarity Measures 5

  • Example Calculation Results:

    • Euclidean:

    • Distance between Observations 1 and 2: (calculate)

    • Distance between Observations 1 and 3: 7.62

    • Distance between Observations 2 and 3: 7.21

    • Manhattan:

    • Distance between Observations 1 and 2: (calculate)

    • Distance between Observations 1 and 3: 10

    • Distance between Observations 2 and 3: 10


11.2: Similarity Measures 6

  • Influences on Distance Measures:

    • Scale of each variable can affect calculated distances.

    • Different scales can distort the true distances yielding inaccurate results.

    • To negate this, standardize variables to be unit-free ensuring equal weight during distance calculations.

    • Standardization Methods:

    • Z-scores: Distance from mean expressed in standard deviation.

    • Min-max normalization: Rescale each observation to a 0-1 range.


11.2: Similarity Measures 7

  • Example Using Consumer Data:

    • Five consumers' annual income and hours spent online analyzed for distance measures.

    • Standardized Observations Result:

    • Data for each individual with standardized and normalized values provided

      • Jane: Income: 125,678, Hours: 2.5 → Standardized: 1.2473, -1.5071; Normalized: 1.0000, 0.0000

      • Kevin: Income: 65,901, Hours: 10.1 → Standardized: -1.1892, 1.0382; Normalized: 0.0000, 1.0000

      • Dolores: Income: 75,550, Hours: 5.8 → Standardized: -0.7959, -0.4019; Normalized: 0.1614, 0.4342

      • Deshaun: Income: 110,250, Hours: 9.0 → Standardized: 0.6184, 0.6698; Normalized: 0.7419, 0.8553

      • Mei: Income: 98,005, Hours: 7.6 → Standardized: 0.1194, 0.2010; Normalized: 0.5371, 0.6711


11.2: Similarity Measures 8

  • Applicability of Distance Measures:

    • Euclidean and Manhattan Distances: Suitable for numerical variables.

    • Categorical Variables: Require different measures of similarity.

    • Binary Variable: A categorical variable with two categories.

    • Common Categorical Measures:

    • Matching coefficient

    • Jaccard’s coefficient


11.2: Similarity Measures 9

  • Matching Coefficient:

    • Based on value matching to determine similarities in categorical data.

    • Higher values signify more similarity, while a value of 1 indicates a perfect match.


11.2: Similarity Measures 10

  • Example Calculation of Matching Coefficients:

    • For a group of college students with major, field, gender, and Dean’s List status, compute matching coefficients and analyze student pair similarities.

    • Sample Student Records:

    • Student 1: Major: Business, Field: MIS, Sex: Female, Dean’s List: Yes

    • Student 2: Major: Engineering, Field: Electrical, Sex: Male, Dean’s List: Yes

    • Student 3: Major: Business, Field: Accounting, Sex: Female, Dean’s List: No


11.2: Similarity Measures 11

  • Similarity Analysis Results:

    • Compute matching coefficients among student pairs for:

    • Student 1 and 2

    • Student 1 and 3

    • Student 2 and 3


11.2: Similarity Measures 12

  • Limitations of the Matching Coefficient:

    • Does not differentiate between positive and negative outcomes potentially leading to misleading measures.

    • Jaccard’s Coefficient:

    • Excludes negative outcomes providing a purer measure of similarity.


11.2: Similarity Measures 13

  • Example in Retail Transaction Data:

    • Retail transactions include products purchased denoted as 'yes' for a purchase and 'no' for absence of purchase.

    • Transaction Examples:

    • Transaction 1: Items - Keyboard (Yes), Memory card (No), Mouse (Yes), USB drive (No), Headphone (Yes)

    • Transaction 2: Items - Keyboard (Yes), Memory card (Yes), Mouse (Yes), USB drive (No), Headphone (No)

    • Transaction 3: Items - Keyboard (No), Memory card (No), Mouse (No), USB drive (No), Headphone (Yes)


11.2: Similarity Measures 14

  • Continuing Retail Transaction Analysis: Analyze Jaccard’s coefficients alongside matching coefficients for transaction pairwise sets.


Conclusion

  • All rights reserved. No reproduction or distribution without the prior written consent of McGraw Hill LLC.