Chapter 11: Data Mining Vocabulary
11.1 Data Mining Overview
The terms ‘artificial intelligence,’ ‘machine learning,’ and ‘data mining’ are all used interchangeably. Their definitions overlap with no clear boundaries. They describe applications of computer software used to obtain insightful solutions that traditional data analysis techniques may not be able to achieve.
In a very broad sense:
Artificial intelligence (AI) describes computer systems that demonstrate human-like intelligence and cognitive abilities, including:
Deduction
Pattern recognition
Interpretation of complex data
Machine learning (ML) describes techniques that integrate self-learning algorithms. It is an application of AI that allows the computer to learn automatically without human intervention or assistance, designed to evaluate results and to improve performance over time. ML techniques can uncover hidden patterns and relationships in data. Use self-learning algorithms to evaluate results and improve performance over time.
Examples: Predict rider demand to strategically dispatch drivers for Uber.
Data mining describes the process of applying a set of analytical techniques necessary for the development of machine learning and artificial intelligence. Data mining is often recognized as a building block of ML and AI.
Goals: uncover hidden patterns and relationships in data; gain insights and derive relevant information to help make decisions.
Techniques are used for data segmentation, pattern recognition, classification, and prediction.
Example: Group customers into segments for customized promotions.
CRISP-DM (Cross-Industry Standard Process for Data Mining) methodology:
Popular holistic approach to data mining projects that emphasizes business goals and objectives prior to preparing the data and choosing analysis techniques.
CRISP-DM was developed in the 1990s by SPSS, Teradata, Daimler AG, NCR, and OHRA.
Six major phases:
Business understanding: situational context, specific objectives, project schedule, deliverables.
Data understanding: collecting raw data, preliminary results, potential hypotheses.
Data preparation: record and variable selection, wrangling, cleaning.
Modeling: selection and execution of data mining techniques, convert or transform data to formats/types needed for certain analyses, document assumptions, cross-validation.
Evaluation: evaluate performance of competing models, select best models, review and interpret results, develop recommendations.
Deployment: develop a set of actionable insights and a strategy for deployment/monitoring/feedback.
Not every step is needed for all data mining applications. The data preparation phase plays a significant role in the data mining process. An analyst or analytics team tends to spend a sizable portion of project time (often 80%) on understanding, cleansing, transforming, and preparing data leading up to modeling activities.
The methodology is popular because it offers a holistic approach with detailed phases, tasks, and activities.
Other data mining methodologies include SEMMA (Sample, Explore, Modify, Model, and Assess) and KDD (Knowledge Discovery in Databases).
Data mining algorithms are classified into two types based on how they learn about data:
Supervised: developing predictive models where the target variable is identified.
In regression, the target variable is the response variable. Historical values of the target variable exist in the data set. A regression model is trained using known target values. Performance is evaluated by how predicted values deviate from actual values.
Unsupervised: data exploration, dimension reduction, and pattern recognition with no target variable identified. Used for exploratory data analysis and descriptive analytics, and often used prior to supervised learning to understand the data set, formulate questions, or summarize data. Common applications include dimension reduction and pattern recognition.
Supervised data mining algorithms commonly rely on classic statistical techniques, including:
Linear regression
Logistic regression
These use information on predictor variables to predict and/or describe changes in the target variable y.
The model is trained (supervised) because the known values of the target variable are used to build the model.
Model performance is evaluated based on how well the predicted values match actual values.
Common applications in supervised learning:
Classification: target variable is categorical; aim is to predict class membership for new cases (e.g., stock buy/hold/sell).
Prediction: target variable is numerical; aim is to predict a numeric target for a new case (e.g., customer spending).
Other ML algorithms include k-Nearest Neighbors (k-NN), Naïve Bayes, and Decision Trees.
Unsupervised learning is an important part of exploratory data analysis and descriptive analytics:
Used prior to supervised learning to understand the data, formulate questions, or summarize data.
Common applications include dimension reduction and pattern recognition.
Dimension reduction and pattern recognition:
Dimension reduction: converts a high-dimensional data set (many variables) into data with fewer dimensions without losing much information.
Deploy before other data mining methods.
Reduces information redundancy and improves model stability.
Particularly relevant for big data to reveal important patterns and build more stable models.
Pattern recognition: identifying patterns using machine learning.
Recurring sequences, frequent combinations, recognizable features, common characteristics.
Introductory Case: Social Media Marketing (FashionTech)
FashionTech is an online apparel retailer focusing on activewear for men and women.
Target market: individuals aged 18–35 with an active and/or outdoors lifestyle.
Marketing channels include TV ads, quarterly catalogs, product placements, search engines, and social media.
MarketWiz was hired to develop predictive models to help FashionTech acquire new customers and increase sales from existing customers.
Two types of predictive models developed:
A classification model that predicts potential customers’ purchase probability within 30 days of receiving a promotional message in their social media account.
Two prediction models that predict the one-year purchase amounts of customers acquired through social media channels.
Validation data set usage:
(1) Evaluate how accurately the classification model classifies potential customers into the purchase and no-purchase classes.
(2) Compare the performance of the prediction models that estimate the one-year purchase amounts of customers acquired through social media channels.
11.2 Similarity Measures
Similarity measures gauge whether a group of observations is similar or dissimilar to one another. They are based on the distance between pairwise observations (records) of the variables.
Small distance implies high similarity; large distance implies low similarity.
Represent observations as points in a k-dimensional space. Example: consider three observations on two variables.
Euclidean distance (d_E):
Widely used.
dE(i,j) = \sqrt{\sum{m=1}^{k} (x{im} - x{jm})^2}
Manhattan distance (d_M):
Shortest path for a vehicle to follow.
dM(i,j) = \sum{m=1}^{k} |x{im} - x{jm}|
The Euclidean distance is more influenced by outliers than the Manhattan distance.
Example: three observations on two variables
Euclidean: 1 and 2: [blank in slide], 1 and 3: 7.62, 2 and 3: 7.21
Manhattan: 1 and 2: [blank], 1 and 3: [blank], 2 and 3: [blank] (slide shows specific values for some pairs only)
The scale of each variable can influence the distance measures:
Different scales can distort the true distance between points and lead to inaccurate results.
Solution: make values unit-free so each value receives equal weight when calculating distance measures.
Standardize (z-score): compute a z-score, distance of an observation from the mean in terms of standard deviation.
Min–max normalization: rescale each value to be between 0 and 1.
Formulas:
Standardization (z-score): z = \frac{x - \mu}{\sigma}
Min–max normalization: x' = \frac{x - \min}{\max - \min}
Example: a sample of five consumers with Annual Income (Income, $) and Hours Spent Online per Week (Hours Spent)
Distances on raw data can be distorted because Income values are much larger than Hours Spent.
Standardize using z-scores for Income and Hours Spent.
Also apply min–max normalization for Income and Hours Spent.
Data (formatted from slides):
Jane: Income 125,678; Hours 2.5; Standardized income 1.2473; Standardized hours -1.5071; Normalized income 1.0000; Normalized hours 0.0000
Kevin: Income 65,901; Hours 10.1; Standardized income -1.1892; Standardized hours 1.0382; Normalized income 0.0000; Normalized hours 1.0000
Dolores: Income 75,550; Hours 5.8; Standardized income -0.7959; Standardized hours -0.4019; Normalized income 0.1614; Normalized hours 0.4342
Deshaun: Income 110,250; Hours 9.0; Standardized income 0.6184; Standardized hours 0.6698; Normalized income 0.7419; Normalized hours 0.8553
Mei: Income 98,005; Hours 7.6; Standardized income 0.1194; Standardized hours 0.2010; Normalized income 0.5371; Normalized hours 0.6711
For numerical variables, Euclidean and Manhattan distance measures are suitable.
For categorical variables, use other measures of similarity. A categorical variable with only two categories is binary.
Two commonly used measures for categorical and binary variables:
Matching coefficient
Jaccard’s coefficient
Matching coefficient (for categorical data):
Based on matching values to determine similarities. The higher the value, the more similar.
A value of 1 implies a perfect match.
Example: list of college students with Major, Field, Sex, and whether the student is on the Dean’s List. Records:
1: Major = Business, Field = MIS, Sex = Female, Dean’s List = Yes
2: Major = Engineering, Field = Electrical, Sex = Male, Dean’s List = Yes
3: Major = Business, Field = Accounting, Sex = Female, Dean’s List = No
Pairwise similarities (1-2, 1-3, 2-3) are computed based on how many attributes match; slide lists the pairwise values (1-2:, 1-3:, 2-3:) but does not display the numeric results in the provided excerpt.
Limitation: The matching coefficient does not distinguish between positive and negative outcomes, which may mislead similarity interpretation.
Jaccard’s coefficient (for binary/categorical data):
Ignores negative outcomes (i.e., it considers only the presence of attributes).
For binary attributes, defined as: J(A,B) = \frac{|A \cap B|}{|A \cup B|}
Example: Retail transaction data with Yes/No for five products across transactions. Data:
Transaction 1: Keyboard = Yes, Memory Card = No, Mouse = Yes, USB Drive = No, Headphone = Yes
Transaction 2: Keyboard = Yes, Memory Card = Yes, Mouse = Yes, USB Drive = No, Headphone = No
Transaction 3: Keyboard = No, Memory Card = No, Mouse = No, USB Drive = No, Headphone = Yes
Transaction 4: Keyboard = Yes, Memory Card = No, Mouse = No, USB Drive = No, Headphone = No
Task: Compute and compare the matching coefficients and Jaccard’s coefficients for all pairwise transactions.
Note: The slides show both matching and Jaccard frameworks for these binary patterns; Jaccard focuses on the presence (Yes) patterns, while matching considers both Yes/No alignments.
Introductory Case, continued: Practical implications
The case illustrates how different similarity measures (Euclidean/Manhattan for numeric data; Matching/Jaccard for categorical/binary data) influence clustering, segmentation, and model choice in marketing analytics.
Scaling and normalization are crucial before applying distance-based methods to ensure fair contribution from each feature.
Choosing the right similarity measure depends on data types (numeric vs categorical/binary) and analysis goals (emphasizing presence of attributes vs exact matches).